docprocessor

Name	docprocessor JSON
Version	1.1.0 JSON
	download
home_page	https://github.com/Knowledge-Innovation-Centre/doc-processor
Summary	Intelligent document processing with OCR, chunking, and AI summarization
upload_time	2025-10-30 20:49:12
maintainer	None
docs_url	None
author	Knowledge Innovation Centre
requires_python	>=3.9
license	MIT
keywords	document-processing ocr nlp meilisearch ai pdf chunking summarization
VCS
bugtrack_url
requirements	pdfminer.six pdf2image pytesseract opencv-python Pillow python-docx python-pptx langchain-text-splitters tiktoken numpy
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # DocProcessor

[![Python Version](https://img.shields.io/badge/python-3.8%2B-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![CI Status](https://github.com/Knowledge-Innovation-Centre/doc-processor/workflows/CI/badge.svg)](https://github.com/Knowledge-Innovation-Centre/doc-processor/actions)

A Python library for processing documents with OCR, semantic chunking, and LLM-based summarization. Designed for building semantic search systems and document analysis workflows.

## Table of Contents

- [Features](#features)
- [Installation](#installation)
- [Quick Start](#quick-start)
- [Advanced Usage](#advanced-usage)
- [API Reference](#api-reference)
- [Development](#development)
- [Contributing](#contributing)
- [License](#license)

## Features

- **Multi-format Support**: PDF, DOCX, PPTX, TXT, MD, and images (PNG, JPG, GIF, BMP)
- **Intelligent OCR**: Layout-aware PDF text extraction with OCR fallback for images
- **Semantic Chunking**: Smart text segmentation using LangChain's RecursiveCharacterTextSplitter
- **LLM Summarization**: Generate concise document summaries (with fallback)
- **Meilisearch Integration**: Built-in support for indexing to Meilisearch
- **Flexible API**: Use components individually or as a unified pipeline

## Installation

### From PyPI (Coming Soon)

```bash
pip install docprocessor
```

### From GitHub

```bash
pip install git+https://github.com/Knowledge-Innovation-Centre/doc-processor.git
```

### For Development

```bash
git clone https://github.com/Knowledge-Innovation-Centre/doc-processor.git
cd doc-processor
pip install -e ".[dev]"
```

### System Dependencies

For OCR functionality, install system packages:

**Ubuntu/Debian:**
```bash
sudo apt-get install tesseract-ocr poppler-utils
```

**macOS:**
```bash
brew install tesseract poppler
```

## Quick Start

### Basic Usage

```python
from docprocessor import DocumentProcessor

# Initialize processor
processor = DocumentProcessor()

# Process a document
result = processor.process(
    file_path="document.pdf",
    extract_text=True,
    chunk=True,
    summarize=False  # Requires LLM client
)

print(f"Extracted {len(result.text)} characters")
print(f"Created {len(result.chunks)} chunks")
```

### With LLM Summarization

```python
from docprocessor import DocumentProcessor

# Your LLM client (must have a complete_chat method)
class MyLLMClient:
    def complete_chat(self, messages, temperature):
        # Call your LLM API (OpenAI, Anthropic, Mistral, etc.)
        return {"content": "Generated summary here"}

llm_client = MyLLMClient()

processor = DocumentProcessor(
    llm_client=llm_client,
    summary_target_words=500
)

result = processor.process(
    file_path="document.pdf",
    summarize=True
)

print(f"Summary: {result.summary}")
```

### With Meilisearch Indexing

```python
from docprocessor import DocumentProcessor, MeiliSearchIndexer

# Process document
processor = DocumentProcessor()
result = processor.process("document.pdf", chunk=True)

# Index to Meilisearch
indexer = MeiliSearchIndexer(
    url="http://localhost:7700",
    api_key="your_master_key",
    index_prefix="dev_"  # Optional environment prefix
)

# Convert chunks to search documents
search_docs = processor.chunks_to_search_documents(result.chunks)

# Index chunks
indexer.index_chunks(
    chunks=search_docs,
    index_name="document_chunks"
)

# Search
results = indexer.search(
    query="artificial intelligence",
    index_name="document_chunks",
    limit=10
)
```

## Advanced Usage

### Custom Chunking Parameters

```python
processor = DocumentProcessor(
    chunk_size=1024,      # Larger chunks
    chunk_overlap=100,    # More overlap
    min_chunk_size=200    # Higher minimum
)

chunks = processor.chunk_text(
    text="Your long document text here...",
    filename="document.txt"
)
```

### Extract Text Only

```python
processor = DocumentProcessor()

extraction = processor.extract_text("document.pdf")

print(f"Text: {extraction['text']}")
print(f"Pages: {extraction['page_count']}")
print(f"Format: {extraction['metadata']['format']}")
```

### Multi-Environment Indexing

```python
# Index to multiple environments
environments = {
    "dev": {
        "url": "http://localhost:7700",
        "api_key": "dev_key",
        "prefix": "dev_"
    },
    "prod": {
        "url": "https://search.production.com",
        "api_key": "prod_key",
        "prefix": "prod_"
    }
}

for env_name, config in environments.items():
    indexer = MeiliSearchIndexer(
        url=config["url"],
        api_key=config["api_key"],
        index_prefix=config["prefix"]
    )

    indexer.index_chunks(search_docs, "document_chunks")
    print(f"Indexed to {env_name}")
```

## API Reference

### DocumentProcessor

Main class for document processing.

**Parameters:**
- `ocr_enabled` (bool): Enable OCR for PDFs/images. Default: `True`
- `chunk_size` (int): Target chunk size in tokens. Default: `512`
- `chunk_overlap` (int): Overlap between chunks. Default: `50`
- `min_chunk_size` (int): Minimum chunk size. Default: `100`
- `summary_target_words` (int): Target summary length. Default: `500`
- `llm_client` (Optional[Any]): LLM client for summarization
- `llm_temperature` (float): LLM temperature. Default: `0.3`

**Methods:**
- `process()`: Full pipeline (extract, chunk, summarize)
- `extract_text()`: Extract text from document
- `chunk_text()`: Chunk text into segments
- `summarize_text()`: Generate summary
- `chunks_to_search_documents()`: Convert chunks for indexing

### MeiliSearchIndexer

Interface for Meilisearch operations.

**Parameters:**
- `url` (str): Meilisearch server URL
- `api_key` (str): Meilisearch API key
- `index_prefix` (Optional[str]): Prefix for index names

**Methods:**
- `index_chunks()`: Index multiple documents
- `index_document()`: Index single document
- `search()`: Search an index
- `delete_document()`: Delete by ID
- `delete_documents_by_filter()`: Delete by filter
- `create_index()`: Create new index

### DocumentChunk

Data class representing a text chunk.

**Attributes:**
- `chunk_id` (str): Unique identifier
- `file_id` (str): Source file identifier
- `output_id` (str): Output identifier
- `project_id` (int): Project identifier
- `filename` (str): Source filename
- `chunk_number` (int): Chunk sequence number
- `total_chunks` (int): Total chunks in document
- `chunk_text` (str): The chunk text content
- `token_count` (int): Number of tokens
- `pages` (List[int]): Page numbers (for PDFs)
- `metadata` (Dict): Additional metadata

## Architecture

DocProcessor consists of several independent components:

1. **ContentExtractor**: Extracts text from various file formats
2. **DocumentChunker**: Splits text into semantic segments
3. **DocumentSummarizer**: Generates LLM-based summaries
4. **MeiliSearchIndexer**: Indexes documents to Meilisearch

Each component can be used independently or through the unified `DocumentProcessor` API.

## Requirements

**Python**: 3.10+ (tested on 3.10, 3.11, 3.12)

**Core Dependencies:**
- pdfminer.six - PDF text extraction
- pdf2image - PDF to image conversion
- pytesseract - OCR engine
- opencv-python - Image preprocessing
- Pillow - Image handling
- python-docx - DOCX extraction
- python-pptx - PPTX extraction
- langchain-text-splitters - Semantic chunking
- tiktoken - Token counting

**Optional:**
- meilisearch - Search engine integration

## Examples

See the `examples/` directory for more usage examples:

- `basic_usage.py` - Simple document processing
- `multi_environment.py` - Indexing to multiple environments
- `custom_chunking.py` - Advanced chunking options

## Development

### Using GitHub Codespaces (Recommended)

The easiest way to start developing:

1. Click the **Code** button on GitHub
2. Select **Codespaces** → **Create codespace on main**
3. Wait for the environment to build (includes all dependencies)
4. Start coding!

The devcontainer automatically installs:
- Python 3.11
- All system dependencies (Tesseract, Poppler)
- Python dependencies in editable mode
- Pre-commit hooks
- VS Code extensions (Black, isort, flake8, etc.)

### Local Development Setup

```bash
# Clone the repository
git clone https://github.com/Knowledge-Innovation-Centre/doc-processor.git
cd doc-processor

# Install system dependencies
# Ubuntu/Debian
sudo apt-get install tesseract-ocr poppler-utils

# macOS
brew install tesseract poppler

# Install Python dependencies
pip install -e ".[dev]"

# Install pre-commit hooks
pre-commit install

# Run tests
pytest

# Run tests with coverage
pytest --cov=docprocessor
```

### Code Quality

We use automated tools to maintain code quality:

```bash
# Format code
black docprocessor tests

# Sort imports
isort docprocessor tests

# Lint
flake8 docprocessor tests

# Type check
mypy docprocessor

# Or run all checks with pre-commit
pre-commit run --all-files
```

### Running Tests

```bash
# Run all tests
pytest

# With coverage report
pytest --cov=docprocessor --cov-report=html

# Run specific test file
pytest tests/test_processor.py -v

# Run tests matching pattern
pytest -k "test_extract" -v
```

## Contributing

We love contributions! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for details on:

- Development setup
- Code style guidelines
- Testing requirements
- Pull request process
- Issue reporting

Quick tips:
- Use the devcontainer for consistent environment
- Write tests for new features
- Follow PEP 8 and use pre-commit hooks
- Update documentation for API changes
- Add entries to CHANGELOG.md

## Changelog

See [CHANGELOG.md](CHANGELOG.md) for version history and release notes.

## License

MIT License - see [LICENSE](LICENSE) file for details.

## Support

- **Issues**: [GitHub Issues](https://github.com/Knowledge-Innovation-Centre/doc-processor/issues)
- **Discussions**: [GitHub Discussions](https://github.com/Knowledge-Innovation-Centre/doc-processor/discussions)
- **Email**: info@knowledgeinnovation.eu

## Citation

If you use docprocessor in your research or project, please cite:

```bibtex
@software{docprocessor2025,
  title = {docprocessor: Intelligent Document Processing Library},
  author = {Knowledge Innovation Centre},
  year = {2025},
  url = {https://github.com/Knowledge-Innovation-Centre/doc-processor}
}
```

---

Made with ❤️ by [Knowledge Innovation Centre](https://knowledgeinnovation.eu)

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/Knowledge-Innovation-Centre/doc-processor",
    "name": "docprocessor",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "document-processing, ocr, nlp, meilisearch, ai, pdf, chunking, summarization",
    "author": "Knowledge Innovation Centre",
    "author_email": "Knowledge Innovation Centre <info@knowledgeinnovation.eu>",
    "download_url": "https://files.pythonhosted.org/packages/ef/b5/edf0b167995091cee9b739c738a58b433238ada5bf25f1af6978223379c1/docprocessor-1.1.0.tar.gz",
    "platform": null,
    "description": "# DocProcessor\n\n[![Python Version](https://img.shields.io/badge/python-3.8%2B-blue.svg)](https://www.python.org/downloads/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n[![CI Status](https://github.com/Knowledge-Innovation-Centre/doc-processor/workflows/CI/badge.svg)](https://github.com/Knowledge-Innovation-Centre/doc-processor/actions)\n\nA Python library for processing documents with OCR, semantic chunking, and LLM-based summarization. Designed for building semantic search systems and document analysis workflows.\n\n## Table of Contents\n\n- [Features](#features)\n- [Installation](#installation)\n- [Quick Start](#quick-start)\n- [Advanced Usage](#advanced-usage)\n- [API Reference](#api-reference)\n- [Development](#development)\n- [Contributing](#contributing)\n- [License](#license)\n\n## Features\n\n- **Multi-format Support**: PDF, DOCX, PPTX, TXT, MD, and images (PNG, JPG, GIF, BMP)\n- **Intelligent OCR**: Layout-aware PDF text extraction with OCR fallback for images\n- **Semantic Chunking**: Smart text segmentation using LangChain's RecursiveCharacterTextSplitter\n- **LLM Summarization**: Generate concise document summaries (with fallback)\n- **Meilisearch Integration**: Built-in support for indexing to Meilisearch\n- **Flexible API**: Use components individually or as a unified pipeline\n\n## Installation\n\n### From PyPI (Coming Soon)\n\n```bash\npip install docprocessor\n```\n\n### From GitHub\n\n```bash\npip install git+https://github.com/Knowledge-Innovation-Centre/doc-processor.git\n```\n\n### For Development\n\n```bash\ngit clone https://github.com/Knowledge-Innovation-Centre/doc-processor.git\ncd doc-processor\npip install -e \".[dev]\"\n```\n\n### System Dependencies\n\nFor OCR functionality, install system packages:\n\n**Ubuntu/Debian:**\n```bash\nsudo apt-get install tesseract-ocr poppler-utils\n```\n\n**macOS:**\n```bash\nbrew install tesseract poppler\n```\n\n## Quick Start\n\n### Basic Usage\n\n```python\nfrom docprocessor import DocumentProcessor\n\n# Initialize processor\nprocessor = DocumentProcessor()\n\n# Process a document\nresult = processor.process(\n    file_path=\"document.pdf\",\n    extract_text=True,\n    chunk=True,\n    summarize=False  # Requires LLM client\n)\n\nprint(f\"Extracted {len(result.text)} characters\")\nprint(f\"Created {len(result.chunks)} chunks\")\n```\n\n### With LLM Summarization\n\n```python\nfrom docprocessor import DocumentProcessor\n\n# Your LLM client (must have a complete_chat method)\nclass MyLLMClient:\n    def complete_chat(self, messages, temperature):\n        # Call your LLM API (OpenAI, Anthropic, Mistral, etc.)\n        return {\"content\": \"Generated summary here\"}\n\nllm_client = MyLLMClient()\n\nprocessor = DocumentProcessor(\n    llm_client=llm_client,\n    summary_target_words=500\n)\n\nresult = processor.process(\n    file_path=\"document.pdf\",\n    summarize=True\n)\n\nprint(f\"Summary: {result.summary}\")\n```\n\n### With Meilisearch Indexing\n\n```python\nfrom docprocessor import DocumentProcessor, MeiliSearchIndexer\n\n# Process document\nprocessor = DocumentProcessor()\nresult = processor.process(\"document.pdf\", chunk=True)\n\n# Index to Meilisearch\nindexer = MeiliSearchIndexer(\n    url=\"http://localhost:7700\",\n    api_key=\"your_master_key\",\n    index_prefix=\"dev_\"  # Optional environment prefix\n)\n\n# Convert chunks to search documents\nsearch_docs = processor.chunks_to_search_documents(result.chunks)\n\n# Index chunks\nindexer.index_chunks(\n    chunks=search_docs,\n    index_name=\"document_chunks\"\n)\n\n# Search\nresults = indexer.search(\n    query=\"artificial intelligence\",\n    index_name=\"document_chunks\",\n    limit=10\n)\n```\n\n## Advanced Usage\n\n### Custom Chunking Parameters\n\n```python\nprocessor = DocumentProcessor(\n    chunk_size=1024,      # Larger chunks\n    chunk_overlap=100,    # More overlap\n    min_chunk_size=200    # Higher minimum\n)\n\nchunks = processor.chunk_text(\n    text=\"Your long document text here...\",\n    filename=\"document.txt\"\n)\n```\n\n### Extract Text Only\n\n```python\nprocessor = DocumentProcessor()\n\nextraction = processor.extract_text(\"document.pdf\")\n\nprint(f\"Text: {extraction['text']}\")\nprint(f\"Pages: {extraction['page_count']}\")\nprint(f\"Format: {extraction['metadata']['format']}\")\n```\n\n### Multi-Environment Indexing\n\n```python\n# Index to multiple environments\nenvironments = {\n    \"dev\": {\n        \"url\": \"http://localhost:7700\",\n        \"api_key\": \"dev_key\",\n        \"prefix\": \"dev_\"\n    },\n    \"prod\": {\n        \"url\": \"https://search.production.com\",\n        \"api_key\": \"prod_key\",\n        \"prefix\": \"prod_\"\n    }\n}\n\nfor env_name, config in environments.items():\n    indexer = MeiliSearchIndexer(\n        url=config[\"url\"],\n        api_key=config[\"api_key\"],\n        index_prefix=config[\"prefix\"]\n    )\n\n    indexer.index_chunks(search_docs, \"document_chunks\")\n    print(f\"Indexed to {env_name}\")\n```\n\n## API Reference\n\n### DocumentProcessor\n\nMain class for document processing.\n\n**Parameters:**\n- `ocr_enabled` (bool): Enable OCR for PDFs/images. Default: `True`\n- `chunk_size` (int): Target chunk size in tokens. Default: `512`\n- `chunk_overlap` (int): Overlap between chunks. Default: `50`\n- `min_chunk_size` (int): Minimum chunk size. Default: `100`\n- `summary_target_words` (int): Target summary length. Default: `500`\n- `llm_client` (Optional[Any]): LLM client for summarization\n- `llm_temperature` (float): LLM temperature. Default: `0.3`\n\n**Methods:**\n- `process()`: Full pipeline (extract, chunk, summarize)\n- `extract_text()`: Extract text from document\n- `chunk_text()`: Chunk text into segments\n- `summarize_text()`: Generate summary\n- `chunks_to_search_documents()`: Convert chunks for indexing\n\n### MeiliSearchIndexer\n\nInterface for Meilisearch operations.\n\n**Parameters:**\n- `url` (str): Meilisearch server URL\n- `api_key` (str): Meilisearch API key\n- `index_prefix` (Optional[str]): Prefix for index names\n\n**Methods:**\n- `index_chunks()`: Index multiple documents\n- `index_document()`: Index single document\n- `search()`: Search an index\n- `delete_document()`: Delete by ID\n- `delete_documents_by_filter()`: Delete by filter\n- `create_index()`: Create new index\n\n### DocumentChunk\n\nData class representing a text chunk.\n\n**Attributes:**\n- `chunk_id` (str): Unique identifier\n- `file_id` (str): Source file identifier\n- `output_id` (str): Output identifier\n- `project_id` (int): Project identifier\n- `filename` (str): Source filename\n- `chunk_number` (int): Chunk sequence number\n- `total_chunks` (int): Total chunks in document\n- `chunk_text` (str): The chunk text content\n- `token_count` (int): Number of tokens\n- `pages` (List[int]): Page numbers (for PDFs)\n- `metadata` (Dict): Additional metadata\n\n## Architecture\n\nDocProcessor consists of several independent components:\n\n1. **ContentExtractor**: Extracts text from various file formats\n2. **DocumentChunker**: Splits text into semantic segments\n3. **DocumentSummarizer**: Generates LLM-based summaries\n4. **MeiliSearchIndexer**: Indexes documents to Meilisearch\n\nEach component can be used independently or through the unified `DocumentProcessor` API.\n\n## Requirements\n\n**Python**: 3.10+ (tested on 3.10, 3.11, 3.12)\n\n**Core Dependencies:**\n- pdfminer.six - PDF text extraction\n- pdf2image - PDF to image conversion\n- pytesseract - OCR engine\n- opencv-python - Image preprocessing\n- Pillow - Image handling\n- python-docx - DOCX extraction\n- python-pptx - PPTX extraction\n- langchain-text-splitters - Semantic chunking\n- tiktoken - Token counting\n\n**Optional:**\n- meilisearch - Search engine integration\n\n## Examples\n\nSee the `examples/` directory for more usage examples:\n\n- `basic_usage.py` - Simple document processing\n- `multi_environment.py` - Indexing to multiple environments\n- `custom_chunking.py` - Advanced chunking options\n\n## Development\n\n### Using GitHub Codespaces (Recommended)\n\nThe easiest way to start developing:\n\n1. Click the **Code** button on GitHub\n2. Select **Codespaces** \u2192 **Create codespace on main**\n3. Wait for the environment to build (includes all dependencies)\n4. Start coding!\n\nThe devcontainer automatically installs:\n- Python 3.11\n- All system dependencies (Tesseract, Poppler)\n- Python dependencies in editable mode\n- Pre-commit hooks\n- VS Code extensions (Black, isort, flake8, etc.)\n\n### Local Development Setup\n\n```bash\n# Clone the repository\ngit clone https://github.com/Knowledge-Innovation-Centre/doc-processor.git\ncd doc-processor\n\n# Install system dependencies\n# Ubuntu/Debian\nsudo apt-get install tesseract-ocr poppler-utils\n\n# macOS\nbrew install tesseract poppler\n\n# Install Python dependencies\npip install -e \".[dev]\"\n\n# Install pre-commit hooks\npre-commit install\n\n# Run tests\npytest\n\n# Run tests with coverage\npytest --cov=docprocessor\n```\n\n### Code Quality\n\nWe use automated tools to maintain code quality:\n\n```bash\n# Format code\nblack docprocessor tests\n\n# Sort imports\nisort docprocessor tests\n\n# Lint\nflake8 docprocessor tests\n\n# Type check\nmypy docprocessor\n\n# Or run all checks with pre-commit\npre-commit run --all-files\n```\n\n### Running Tests\n\n```bash\n# Run all tests\npytest\n\n# With coverage report\npytest --cov=docprocessor --cov-report=html\n\n# Run specific test file\npytest tests/test_processor.py -v\n\n# Run tests matching pattern\npytest -k \"test_extract\" -v\n```\n\n## Contributing\n\nWe love contributions! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for details on:\n\n- Development setup\n- Code style guidelines\n- Testing requirements\n- Pull request process\n- Issue reporting\n\nQuick tips:\n- Use the devcontainer for consistent environment\n- Write tests for new features\n- Follow PEP 8 and use pre-commit hooks\n- Update documentation for API changes\n- Add entries to CHANGELOG.md\n\n## Changelog\n\nSee [CHANGELOG.md](CHANGELOG.md) for version history and release notes.\n\n## License\n\nMIT License - see [LICENSE](LICENSE) file for details.\n\n## Support\n\n- **Issues**: [GitHub Issues](https://github.com/Knowledge-Innovation-Centre/doc-processor/issues)\n- **Discussions**: [GitHub Discussions](https://github.com/Knowledge-Innovation-Centre/doc-processor/discussions)\n- **Email**: info@knowledgeinnovation.eu\n\n## Citation\n\nIf you use docprocessor in your research or project, please cite:\n\n```bibtex\n@software{docprocessor2025,\n  title = {docprocessor: Intelligent Document Processing Library},\n  author = {Knowledge Innovation Centre},\n  year = {2025},\n  url = {https://github.com/Knowledge-Innovation-Centre/doc-processor}\n}\n```\n\n---\n\nMade with \u2764\ufe0f by [Knowledge Innovation Centre](https://knowledgeinnovation.eu)\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Intelligent document processing with OCR, chunking, and AI summarization",
    "version": "1.1.0",
    "project_urls": {
        "Bug Tracker": "https://github.com/Knowledge-Innovation-Centre/doc-processor/issues",
        "Changelog": "https://github.com/Knowledge-Innovation-Centre/doc-processor/blob/main/CHANGELOG.md",
        "Documentation": "https://github.com/Knowledge-Innovation-Centre/doc-processor#readme",
        "Homepage": "https://github.com/Knowledge-Innovation-Centre/doc-processor",
        "Repository": "https://github.com/Knowledge-Innovation-Centre/doc-processor"
    },
    "split_keywords": [
        "document-processing",
        " ocr",
        " nlp",
        " meilisearch",
        " ai",
        " pdf",
        " chunking",
        " summarization"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "e313842ac72a471d0cb065efc24e373792e7811cf966e83688a77770423428c7",
                "md5": "9d7737783691a45a9e8270bf2c9f9b84",
                "sha256": "c2f21a0fda1f6aec9cc1dbe19b892b3e3ea767adde46a04bbd344c7396cc078f"
            },
            "downloads": -1,
            "filename": "docprocessor-1.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "9d7737783691a45a9e8270bf2c9f9b84",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 23487,
            "upload_time": "2025-10-30T20:49:10",
            "upload_time_iso_8601": "2025-10-30T20:49:10.858246Z",
            "url": "https://files.pythonhosted.org/packages/e3/13/842ac72a471d0cb065efc24e373792e7811cf966e83688a77770423428c7/docprocessor-1.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "efb5edf0b167995091cee9b739c738a58b433238ada5bf25f1af6978223379c1",
                "md5": "00296084cc7f74dde6de44bdfb3ec6a3",
                "sha256": "6d85fd38839ca2c5d7486d0a30385dea7ac68807d5cc402d25d402b6661b989a"
            },
            "downloads": -1,
            "filename": "docprocessor-1.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "00296084cc7f74dde6de44bdfb3ec6a3",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 36287,
            "upload_time": "2025-10-30T20:49:12",
            "upload_time_iso_8601": "2025-10-30T20:49:12.602656Z",
            "url": "https://files.pythonhosted.org/packages/ef/b5/edf0b167995091cee9b739c738a58b433238ada5bf25f1af6978223379c1/docprocessor-1.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-30 20:49:12",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Knowledge-Innovation-Centre",
    "github_project": "doc-processor",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "pdfminer.six",
            "specs": [
                [
                    ">=",
                    "20221105"
                ]
            ]
        },
        {
            "name": "pdf2image",
            "specs": [
                [
                    ">=",
                    "1.16.3"
                ]
            ]
        },
        {
            "name": "pytesseract",
            "specs": [
                [
                    ">=",
                    "0.3.10"
                ]
            ]
        },
        {
            "name": "opencv-python",
            "specs": [
                [
                    ">=",
                    "4.9.0"
                ]
            ]
        },
        {
            "name": "Pillow",
            "specs": [
                [
                    ">=",
                    "10.2.0"
                ]
            ]
        },
        {
            "name": "python-docx",
            "specs": [
                [
                    ">=",
                    "1.1.0"
                ]
            ]
        },
        {
            "name": "python-pptx",
            "specs": [
                [
                    ">=",
                    "0.6.21"
                ]
            ]
        },
        {
            "name": "langchain-text-splitters",
            "specs": [
                [
                    ">=",
                    "0.2.0"
                ]
            ]
        },
        {
            "name": "tiktoken",
            "specs": [
                [
                    ">=",
                    "0.7.0"
                ]
            ]
        },
        {
            "name": "numpy",
            "specs": [
                [
                    ">=",
                    "1.24.0"
                ]
            ]
        }
    ],
    "lcname": "docprocessor"
}

Knowledge Innovation Centre