osd-text-extractor

Name	osd-text-extractor JSON
Version	0.1.1 JSON
	download
home_page	None
Summary	A Python library for extracting plain text from various document formats for LLM and NLP purposes
upload_time	2025-08-28 06:08:05
maintainer	None
docs_url	None
author	None
requires_python	>=3.12
license	MIT License Copyright (c) 2025 OneSlap Team Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords	text-extraction pdf docx xlsx html llm nlp
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # OSD Text Extractor

A Python library for extracting plain text from various document formats for LLM and NLP purposes.

## Features

- **Multi-format support**: Extract text from PDF, DOCX, XLSX, HTML, XML, JSON, Markdown, RTF, CSV, EPUB, FB2, ODS, ODT, and TXT files
- **Clean output**: Automatically removes non-Latin characters, normalizes whitespace, and filters out formatting artifacts
- **LLM-ready**: Produces clean, plain text optimized for language model processing
- **Robust error handling**: Comprehensive exception handling with detailed error messages
- **Memory efficient**: Handles large files with appropriate size limits and safeguards
- **Type safe**: Full type hints and mypy compliance

## Installation

```bash
pip install osd-text-extractor
```

## Quick Start

```python
from osd_text_extractor import extract_text

# Extract text from a file
with open("document.pdf", "rb") as f:
    content = f.read()

text = extract_text(content, "pdf")
print(text)
```

## Supported Formats

| Format | Extension | Description |
|--------|-----------|-------------|
| PDF | `.pdf` | Portable Document Format |
| DOCX | `.docx` | Microsoft Word documents |
| XLSX | `.xlsx` | Microsoft Excel spreadsheets |
| HTML | `.html`, `.htm` | Web pages |
| XML | `.xml` | XML documents |
| JSON | `.json` | JSON data files |
| Markdown | `.md` | Markdown documents |
| RTF | `.rtf` | Rich Text Format |
| CSV | `.csv` | Comma-separated values |
| TXT | `.txt` | Plain text files |
| EPUB | `.epub` | Electronic books |
| FB2 | `.fb2` | FictionBook format |
| ODS | `.ods` | OpenDocument Spreadsheet |
| ODT | `.odt` | OpenDocument Text |

## Usage Examples

### Basic Text Extraction

```python
from osd_text_extractor import extract_text

# PDF extraction
with open("report.pdf", "rb") as f:
    pdf_text = extract_text(f.read(), "pdf")

# HTML extraction
html_content = b"<html><body><h1>Title</h1><p>Content</p></body></html>"
html_text = extract_text(html_content, "html")

# JSON extraction
json_content = b'{"title": "Document", "content": "Text content"}'
json_text = extract_text(json_content, "json")
```

### Working with Different File Types

```python
import os
from osd_text_extractor import extract_text

def extract_from_file(file_path):
    # Get file extension
    _, ext = os.path.splitext(file_path)
    format_name = ext[1:].lower()  # Remove dot and lowercase

    # Read file content
    with open(file_path, "rb") as f:
        content = f.read()

    # Extract text
    try:
        text = extract_text(content, format_name)
        return text
    except Exception as e:
        print(f"Failed to extract text from {file_path}: {e}")
        return None

# Usage
text = extract_from_file("document.docx")
if text:
    print(f"Extracted {len(text)} characters")
```

### Batch Processing

```python
import os
from pathlib import Path
from osd_text_extractor import extract_text

def process_directory(directory_path, output_file):
    supported_extensions = {'.pdf', '.docx', '.xlsx', '.html', '.xml',
                          '.json', '.md', '.rtf', '.csv', '.txt',
                          '.epub', '.fb2', '.ods', '.odt'}

    results = []

    for file_path in Path(directory_path).rglob('*'):
        if file_path.suffix.lower() in supported_extensions:
            try:
                with open(file_path, 'rb') as f:
                    content = f.read()

                format_name = file_path.suffix[1:].lower()
                text = extract_text(content, format_name)

                results.append({
                    'file': str(file_path),
                    'text': text,
                    'length': len(text)
                })
                print(f"✓ Processed {file_path}")

            except Exception as e:
                print(f"✗ Failed {file_path}: {e}")

    # Save results
    with open(output_file, 'w', encoding='utf-8') as f:
        for result in results:
            f.write(f"=== {result['file']} ===\n")
            f.write(f"{result['text']}\n\n")

    print(f"Processed {len(results)} files, saved to {output_file}")

# Usage
process_directory("./documents", "extracted_texts.txt")
```

## Text Cleaning

The library automatically cleans extracted text:

- **Character filtering**: Removes non-Latin characters (Cyrillic, Chinese, Arabic, emojis, etc.)
- **Whitespace normalization**: Collapses multiple spaces, tabs, and line breaks
- **Artifact removal**: Strips HTML tags, markdown syntax, and formatting codes
- **Emoji removal**: Filters out emoji characters

### Example of text cleaning:

```python
# Input text with mixed content
raw_text = "English text Русский 中文 with symbols @#$% and emojis 🌍"

# After extraction and cleaning
cleaned_text = "English text with symbols and emojis"
```

## Error Handling

The library provides specific exceptions for different error scenarios:

```python
from osd_text_extractor import extract_text
from osd_text_extractor.application.exceptions import UnsupportedFormatError
from osd_text_extractor.domain.exceptions import TextLengthError
from osd_text_extractor.infrastructure.exceptions import ExtractionError

try:
    text = extract_text(content, format_name)
except UnsupportedFormatError:
    print("File format not supported")
except TextLengthError:
    print("No valid text content found")
except ExtractionError as e:
    print(f"Extraction failed: {e}")
except Exception as e:
    print(f"Unexpected error: {e}")
```

## Security Features

The library includes several security protections:

- **Size limits**: Prevents processing of excessively large files
- **XML bomb protection**: Guards against malicious XML with excessive nesting or entity expansion
- **Memory safeguards**: Limits memory usage during processing
- **Input validation**: Validates file formats and content structure

## Performance Considerations

- **Memory usage**: Files are processed in memory, consider available RAM for large files
- **Processing speed**: Varies by format complexity (TXT > HTML > PDF > DOCX)
- **Concurrent processing**: Library is thread-safe for concurrent usage

## Dependencies

Core dependencies:
- `beautifulsoup4` - HTML/XML parsing
- `lxml` - XML processing
- `pymupdf` - PDF processing
- `python-docx` - DOCX processing
- `openpyxl` - XLSX processing
- `striprtf` - RTF processing
- `odfpy` - ODS/ODT processing
- `emoji` - Emoji handling
- `dishka` - Dependency injection

## Development

### Setting up development environment

```bash
# Clone repository
git clone https://github.com/OneSlap/osd-text-extractor.git
cd osd-text-extractor

# Install UV (package manager)
pip install uv

# Install dependencies
uv sync --dev

# Run tests
uv run pytest

# Run linting
uv run ruff check src/ tests/
uv run ruff format src/ tests/

# Run type checking
uv run mypy src/
```

### Running tests

```bash
# Run all tests
uv run pytest

# Run with coverage
uv run pytest --cov=src/osd_text_extractor --cov-report=html

# Run specific test file
uv run pytest tests/unit/test_domain/test_domain_entities.py

# Run integration tests only
uv run pytest tests/integration/
```

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Contributing

1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Make your changes
4. Add tests for new functionality
5. Run the test suite (`uv run pytest`)
6. Commit your changes (`git commit -m 'Add amazing feature'`)
7. Push to the branch (`git push origin feature/amazing-feature`)
8. Open a Pull Request

## Changelog

### v0.1.0
- Initial release
- Support for 14 document formats
- Clean architecture with dependency injection
- Comprehensive test suite
- Type safety with mypy
- Security protections for XML processing

## Support

- **Issues**: [GitHub Issues](https://github.com/OneSlap/osd-text-extractor/issues)
- **Documentation**: [GitHub README](https://github.com/OneSlap/osd-text-extractor#readme)
- **Source Code**: [GitHub Repository](https://github.com/OneSlap/osd-text-extractor)

## Roadmap

- [ ] Add support for PowerPoint (PPTX) files
- [ ] Implement streaming processing for very large files
- [ ] Add OCR support for image-based PDFs
- [ ] Improve text structure preservation
- [ ] Add configuration options for text cleaning
- [ ] Performance optimizations for batch processing

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "osd-text-extractor",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.12",
    "maintainer_email": null,
    "keywords": "text-extraction, pdf, docx, xlsx, html, llm, nlp",
    "author": null,
    "author_email": "OneSlap Team <lxarbuz@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/a1/59/68ae67cb11983c8320bc965644dcc4b036d42d0c2bfe09e2385c5b43fdde/osd_text_extractor-0.1.1.tar.gz",
    "platform": null,
    "description": "# OSD Text Extractor\n\nA Python library for extracting plain text from various document formats for LLM and NLP purposes.\n\n## Features\n\n- **Multi-format support**: Extract text from PDF, DOCX, XLSX, HTML, XML, JSON, Markdown, RTF, CSV, EPUB, FB2, ODS, ODT, and TXT files\n- **Clean output**: Automatically removes non-Latin characters, normalizes whitespace, and filters out formatting artifacts\n- **LLM-ready**: Produces clean, plain text optimized for language model processing\n- **Robust error handling**: Comprehensive exception handling with detailed error messages\n- **Memory efficient**: Handles large files with appropriate size limits and safeguards\n- **Type safe**: Full type hints and mypy compliance\n\n## Installation\n\n```bash\npip install osd-text-extractor\n```\n\n## Quick Start\n\n```python\nfrom osd_text_extractor import extract_text\n\n# Extract text from a file\nwith open(\"document.pdf\", \"rb\") as f:\n    content = f.read()\n\ntext = extract_text(content, \"pdf\")\nprint(text)\n```\n\n## Supported Formats\n\n| Format | Extension | Description |\n|--------|-----------|-------------|\n| PDF | `.pdf` | Portable Document Format |\n| DOCX | `.docx` | Microsoft Word documents |\n| XLSX | `.xlsx` | Microsoft Excel spreadsheets |\n| HTML | `.html`, `.htm` | Web pages |\n| XML | `.xml` | XML documents |\n| JSON | `.json` | JSON data files |\n| Markdown | `.md` | Markdown documents |\n| RTF | `.rtf` | Rich Text Format |\n| CSV | `.csv` | Comma-separated values |\n| TXT | `.txt` | Plain text files |\n| EPUB | `.epub` | Electronic books |\n| FB2 | `.fb2` | FictionBook format |\n| ODS | `.ods` | OpenDocument Spreadsheet |\n| ODT | `.odt` | OpenDocument Text |\n\n## Usage Examples\n\n### Basic Text Extraction\n\n```python\nfrom osd_text_extractor import extract_text\n\n# PDF extraction\nwith open(\"report.pdf\", \"rb\") as f:\n    pdf_text = extract_text(f.read(), \"pdf\")\n\n# HTML extraction\nhtml_content = b\"<html><body><h1>Title</h1><p>Content</p></body></html>\"\nhtml_text = extract_text(html_content, \"html\")\n\n# JSON extraction\njson_content = b'{\"title\": \"Document\", \"content\": \"Text content\"}'\njson_text = extract_text(json_content, \"json\")\n```\n\n### Working with Different File Types\n\n```python\nimport os\nfrom osd_text_extractor import extract_text\n\ndef extract_from_file(file_path):\n    # Get file extension\n    _, ext = os.path.splitext(file_path)\n    format_name = ext[1:].lower()  # Remove dot and lowercase\n\n    # Read file content\n    with open(file_path, \"rb\") as f:\n        content = f.read()\n\n    # Extract text\n    try:\n        text = extract_text(content, format_name)\n        return text\n    except Exception as e:\n        print(f\"Failed to extract text from {file_path}: {e}\")\n        return None\n\n# Usage\ntext = extract_from_file(\"document.docx\")\nif text:\n    print(f\"Extracted {len(text)} characters\")\n```\n\n### Batch Processing\n\n```python\nimport os\nfrom pathlib import Path\nfrom osd_text_extractor import extract_text\n\ndef process_directory(directory_path, output_file):\n    supported_extensions = {'.pdf', '.docx', '.xlsx', '.html', '.xml',\n                          '.json', '.md', '.rtf', '.csv', '.txt',\n                          '.epub', '.fb2', '.ods', '.odt'}\n\n    results = []\n\n    for file_path in Path(directory_path).rglob('*'):\n        if file_path.suffix.lower() in supported_extensions:\n            try:\n                with open(file_path, 'rb') as f:\n                    content = f.read()\n\n                format_name = file_path.suffix[1:].lower()\n                text = extract_text(content, format_name)\n\n                results.append({\n                    'file': str(file_path),\n                    'text': text,\n                    'length': len(text)\n                })\n                print(f\"\u2713 Processed {file_path}\")\n\n            except Exception as e:\n                print(f\"\u2717 Failed {file_path}: {e}\")\n\n    # Save results\n    with open(output_file, 'w', encoding='utf-8') as f:\n        for result in results:\n            f.write(f\"=== {result['file']} ===\\n\")\n            f.write(f\"{result['text']}\\n\\n\")\n\n    print(f\"Processed {len(results)} files, saved to {output_file}\")\n\n# Usage\nprocess_directory(\"./documents\", \"extracted_texts.txt\")\n```\n\n## Text Cleaning\n\nThe library automatically cleans extracted text:\n\n- **Character filtering**: Removes non-Latin characters (Cyrillic, Chinese, Arabic, emojis, etc.)\n- **Whitespace normalization**: Collapses multiple spaces, tabs, and line breaks\n- **Artifact removal**: Strips HTML tags, markdown syntax, and formatting codes\n- **Emoji removal**: Filters out emoji characters\n\n### Example of text cleaning:\n\n```python\n# Input text with mixed content\nraw_text = \"English text \u0420\u0443\u0441\u0441\u043a\u0438\u0439 \u4e2d\u6587 with symbols @#$% and emojis \ud83c\udf0d\"\n\n# After extraction and cleaning\ncleaned_text = \"English text with symbols and emojis\"\n```\n\n## Error Handling\n\nThe library provides specific exceptions for different error scenarios:\n\n```python\nfrom osd_text_extractor import extract_text\nfrom osd_text_extractor.application.exceptions import UnsupportedFormatError\nfrom osd_text_extractor.domain.exceptions import TextLengthError\nfrom osd_text_extractor.infrastructure.exceptions import ExtractionError\n\ntry:\n    text = extract_text(content, format_name)\nexcept UnsupportedFormatError:\n    print(\"File format not supported\")\nexcept TextLengthError:\n    print(\"No valid text content found\")\nexcept ExtractionError as e:\n    print(f\"Extraction failed: {e}\")\nexcept Exception as e:\n    print(f\"Unexpected error: {e}\")\n```\n\n## Security Features\n\nThe library includes several security protections:\n\n- **Size limits**: Prevents processing of excessively large files\n- **XML bomb protection**: Guards against malicious XML with excessive nesting or entity expansion\n- **Memory safeguards**: Limits memory usage during processing\n- **Input validation**: Validates file formats and content structure\n\n## Performance Considerations\n\n- **Memory usage**: Files are processed in memory, consider available RAM for large files\n- **Processing speed**: Varies by format complexity (TXT > HTML > PDF > DOCX)\n- **Concurrent processing**: Library is thread-safe for concurrent usage\n\n## Dependencies\n\nCore dependencies:\n- `beautifulsoup4` - HTML/XML parsing\n- `lxml` - XML processing\n- `pymupdf` - PDF processing\n- `python-docx` - DOCX processing\n- `openpyxl` - XLSX processing\n- `striprtf` - RTF processing\n- `odfpy` - ODS/ODT processing\n- `emoji` - Emoji handling\n- `dishka` - Dependency injection\n\n## Development\n\n### Setting up development environment\n\n```bash\n# Clone repository\ngit clone https://github.com/OneSlap/osd-text-extractor.git\ncd osd-text-extractor\n\n# Install UV (package manager)\npip install uv\n\n# Install dependencies\nuv sync --dev\n\n# Run tests\nuv run pytest\n\n# Run linting\nuv run ruff check src/ tests/\nuv run ruff format src/ tests/\n\n# Run type checking\nuv run mypy src/\n```\n\n### Running tests\n\n```bash\n# Run all tests\nuv run pytest\n\n# Run with coverage\nuv run pytest --cov=src/osd_text_extractor --cov-report=html\n\n# Run specific test file\nuv run pytest tests/unit/test_domain/test_domain_entities.py\n\n# Run integration tests only\nuv run pytest tests/integration/\n```\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## Contributing\n\n1. Fork the repository\n2. Create a feature branch (`git checkout -b feature/amazing-feature`)\n3. Make your changes\n4. Add tests for new functionality\n5. Run the test suite (`uv run pytest`)\n6. Commit your changes (`git commit -m 'Add amazing feature'`)\n7. Push to the branch (`git push origin feature/amazing-feature`)\n8. Open a Pull Request\n\n## Changelog\n\n### v0.1.0\n- Initial release\n- Support for 14 document formats\n- Clean architecture with dependency injection\n- Comprehensive test suite\n- Type safety with mypy\n- Security protections for XML processing\n\n## Support\n\n- **Issues**: [GitHub Issues](https://github.com/OneSlap/osd-text-extractor/issues)\n- **Documentation**: [GitHub README](https://github.com/OneSlap/osd-text-extractor#readme)\n- **Source Code**: [GitHub Repository](https://github.com/OneSlap/osd-text-extractor)\n\n## Roadmap\n\n- [ ] Add support for PowerPoint (PPTX) files\n- [ ] Implement streaming processing for very large files\n- [ ] Add OCR support for image-based PDFs\n- [ ] Improve text structure preservation\n- [ ] Add configuration options for text cleaning\n- [ ] Performance optimizations for batch processing\n",
    "bugtrack_url": null,
    "license": "MIT License\n        \n        Copyright (c) 2025 OneSlap Team\n        \n        Permission is hereby granted, free of charge, to any person obtaining a copy\n        of this software and associated documentation files (the \"Software\"), to deal\n        in the Software without restriction, including without limitation the rights\n        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell\n        copies of the Software, and to permit persons to whom the Software is\n        furnished to do so, subject to the following conditions:\n        \n        The above copyright notice and this permission notice shall be included in all\n        copies or substantial portions of the Software.\n        \n        THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\n        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\n        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\n        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\n        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\n        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\n        SOFTWARE.\n        ",
    "summary": "A Python library for extracting plain text from various document formats for LLM and NLP purposes",
    "version": "0.1.1",
    "project_urls": {
        "Bug Reports": "https://github.com/oneslapdeal/osd_text_extractor/issues",
        "Documentation": "https://github.com/oneslapdeal/osd_text_extractor#readme",
        "Homepage": "https://github.com/oneslapdeal/osd_text_extractor",
        "Source": "https://github.com/oneslapdeal/osd_text_extractor"
    },
    "split_keywords": [
        "text-extraction",
        " pdf",
        " docx",
        " xlsx",
        " html",
        " llm",
        " nlp"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "c2b05806b28ca3a1d8660733d257a23979ec8970a08f45bda83c690f5b608b73",
                "md5": "f5c851468d43f396fb63eb9cd5717298",
                "sha256": "6708fcef76da43f71426a8e3319280c1469fc6a137fc6413aa1a24f6d704872c"
            },
            "downloads": -1,
            "filename": "osd_text_extractor-0.1.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "f5c851468d43f396fb63eb9cd5717298",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.12",
            "size": 24980,
            "upload_time": "2025-08-28T06:08:04",
            "upload_time_iso_8601": "2025-08-28T06:08:04.090069Z",
            "url": "https://files.pythonhosted.org/packages/c2/b0/5806b28ca3a1d8660733d257a23979ec8970a08f45bda83c690f5b608b73/osd_text_extractor-0.1.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "a15968ae67cb11983c8320bc965644dcc4b036d42d0c2bfe09e2385c5b43fdde",
                "md5": "ab7d504181406fe4900943a57225105d",
                "sha256": "c7a53db0bf8f051f1db60553960e4902ac8770ec9283fd89ac12f602c1ccdc2b"
            },
            "downloads": -1,
            "filename": "osd_text_extractor-0.1.1.tar.gz",
            "has_sig": false,
            "md5_digest": "ab7d504181406fe4900943a57225105d",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.12",
            "size": 19346,
            "upload_time": "2025-08-28T06:08:05",
            "upload_time_iso_8601": "2025-08-28T06:08:05.406107Z",
            "url": "https://files.pythonhosted.org/packages/a1/59/68ae67cb11983c8320bc965644dcc4b036d42d0c2bfe09e2385c5b43fdde/osd_text_extractor-0.1.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-28 06:08:05",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "oneslapdeal",
    "github_project": "osd_text_extractor",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "osd-text-extractor"
}

None