# DocProcessor
[](https://www.python.org/downloads/)
[](https://opensource.org/licenses/MIT)
[](https://github.com/psf/black)
[](https://github.com/Knowledge-Innovation-Centre/doc-processor/actions)
A Python library for processing documents with OCR, semantic chunking, and LLM-based summarization. Designed for building semantic search systems and document analysis workflows.
## Table of Contents
- [Features](#features)
- [Installation](#installation)
- [Quick Start](#quick-start)
- [Advanced Usage](#advanced-usage)
- [API Reference](#api-reference)
- [Development](#development)
- [Contributing](#contributing)
- [License](#license)
## Features
- **Multi-format Support**: PDF, DOCX, PPTX, TXT, MD, and images (PNG, JPG, GIF, BMP)
- **Intelligent OCR**: Layout-aware PDF text extraction with OCR fallback for images
- **Semantic Chunking**: Smart text segmentation using LangChain's RecursiveCharacterTextSplitter
- **LLM Summarization**: Generate concise document summaries (with fallback)
- **Meilisearch Integration**: Built-in support for indexing to Meilisearch
- **Flexible API**: Use components individually or as a unified pipeline
## Installation
### From PyPI (Coming Soon)
```bash
pip install docprocessor
```
### From GitHub
```bash
pip install git+https://github.com/Knowledge-Innovation-Centre/doc-processor.git
```
### For Development
```bash
git clone https://github.com/Knowledge-Innovation-Centre/doc-processor.git
cd doc-processor
pip install -e ".[dev]"
```
### System Dependencies
For OCR functionality, install system packages:
**Ubuntu/Debian:**
```bash
sudo apt-get install tesseract-ocr poppler-utils
```
**macOS:**
```bash
brew install tesseract poppler
```
## Quick Start
### Basic Usage
```python
from docprocessor import DocumentProcessor
# Initialize processor
processor = DocumentProcessor()
# Process a document
result = processor.process(
file_path="document.pdf",
extract_text=True,
chunk=True,
summarize=False # Requires LLM client
)
print(f"Extracted {len(result.text)} characters")
print(f"Created {len(result.chunks)} chunks")
```
### With LLM Summarization
```python
from docprocessor import DocumentProcessor
# Your LLM client (must have a complete_chat method)
class MyLLMClient:
def complete_chat(self, messages, temperature):
# Call your LLM API (OpenAI, Anthropic, Mistral, etc.)
return {"content": "Generated summary here"}
llm_client = MyLLMClient()
processor = DocumentProcessor(
llm_client=llm_client,
summary_target_words=500
)
result = processor.process(
file_path="document.pdf",
summarize=True
)
print(f"Summary: {result.summary}")
```
### With Meilisearch Indexing
```python
from docprocessor import DocumentProcessor, MeiliSearchIndexer
# Process document
processor = DocumentProcessor()
result = processor.process("document.pdf", chunk=True)
# Index to Meilisearch
indexer = MeiliSearchIndexer(
url="http://localhost:7700",
api_key="your_master_key",
index_prefix="dev_" # Optional environment prefix
)
# Convert chunks to search documents
search_docs = processor.chunks_to_search_documents(result.chunks)
# Index chunks
indexer.index_chunks(
chunks=search_docs,
index_name="document_chunks"
)
# Search
results = indexer.search(
query="artificial intelligence",
index_name="document_chunks",
limit=10
)
```
## Advanced Usage
### Custom Chunking Parameters
```python
processor = DocumentProcessor(
chunk_size=1024, # Larger chunks
chunk_overlap=100, # More overlap
min_chunk_size=200 # Higher minimum
)
chunks = processor.chunk_text(
text="Your long document text here...",
filename="document.txt"
)
```
### Extract Text Only
```python
processor = DocumentProcessor()
extraction = processor.extract_text("document.pdf")
print(f"Text: {extraction['text']}")
print(f"Pages: {extraction['page_count']}")
print(f"Format: {extraction['metadata']['format']}")
```
### Multi-Environment Indexing
```python
# Index to multiple environments
environments = {
"dev": {
"url": "http://localhost:7700",
"api_key": "dev_key",
"prefix": "dev_"
},
"prod": {
"url": "https://search.production.com",
"api_key": "prod_key",
"prefix": "prod_"
}
}
for env_name, config in environments.items():
indexer = MeiliSearchIndexer(
url=config["url"],
api_key=config["api_key"],
index_prefix=config["prefix"]
)
indexer.index_chunks(search_docs, "document_chunks")
print(f"Indexed to {env_name}")
```
## API Reference
### DocumentProcessor
Main class for document processing.
**Parameters:**
- `ocr_enabled` (bool): Enable OCR for PDFs/images. Default: `True`
- `chunk_size` (int): Target chunk size in tokens. Default: `512`
- `chunk_overlap` (int): Overlap between chunks. Default: `50`
- `min_chunk_size` (int): Minimum chunk size. Default: `100`
- `summary_target_words` (int): Target summary length. Default: `500`
- `llm_client` (Optional[Any]): LLM client for summarization
- `llm_temperature` (float): LLM temperature. Default: `0.3`
**Methods:**
- `process()`: Full pipeline (extract, chunk, summarize)
- `extract_text()`: Extract text from document
- `chunk_text()`: Chunk text into segments
- `summarize_text()`: Generate summary
- `chunks_to_search_documents()`: Convert chunks for indexing
### MeiliSearchIndexer
Interface for Meilisearch operations.
**Parameters:**
- `url` (str): Meilisearch server URL
- `api_key` (str): Meilisearch API key
- `index_prefix` (Optional[str]): Prefix for index names
**Methods:**
- `index_chunks()`: Index multiple documents
- `index_document()`: Index single document
- `search()`: Search an index
- `delete_document()`: Delete by ID
- `delete_documents_by_filter()`: Delete by filter
- `create_index()`: Create new index
### DocumentChunk
Data class representing a text chunk.
**Attributes:**
- `chunk_id` (str): Unique identifier
- `file_id` (str): Source file identifier
- `output_id` (str): Output identifier
- `project_id` (int): Project identifier
- `filename` (str): Source filename
- `chunk_number` (int): Chunk sequence number
- `total_chunks` (int): Total chunks in document
- `chunk_text` (str): The chunk text content
- `token_count` (int): Number of tokens
- `pages` (List[int]): Page numbers (for PDFs)
- `metadata` (Dict): Additional metadata
## Architecture
DocProcessor consists of several independent components:
1. **ContentExtractor**: Extracts text from various file formats
2. **DocumentChunker**: Splits text into semantic segments
3. **DocumentSummarizer**: Generates LLM-based summaries
4. **MeiliSearchIndexer**: Indexes documents to Meilisearch
Each component can be used independently or through the unified `DocumentProcessor` API.
## Requirements
**Python**: 3.10+ (tested on 3.10, 3.11, 3.12)
**Core Dependencies:**
- pdfminer.six - PDF text extraction
- pdf2image - PDF to image conversion
- pytesseract - OCR engine
- opencv-python - Image preprocessing
- Pillow - Image handling
- python-docx - DOCX extraction
- python-pptx - PPTX extraction
- langchain-text-splitters - Semantic chunking
- tiktoken - Token counting
**Optional:**
- meilisearch - Search engine integration
## Examples
See the `examples/` directory for more usage examples:
- `basic_usage.py` - Simple document processing
- `multi_environment.py` - Indexing to multiple environments
- `custom_chunking.py` - Advanced chunking options
## Development
### Using GitHub Codespaces (Recommended)
The easiest way to start developing:
1. Click the **Code** button on GitHub
2. Select **Codespaces** → **Create codespace on main**
3. Wait for the environment to build (includes all dependencies)
4. Start coding!
The devcontainer automatically installs:
- Python 3.11
- All system dependencies (Tesseract, Poppler)
- Python dependencies in editable mode
- Pre-commit hooks
- VS Code extensions (Black, isort, flake8, etc.)
### Local Development Setup
```bash
# Clone the repository
git clone https://github.com/Knowledge-Innovation-Centre/doc-processor.git
cd doc-processor
# Install system dependencies
# Ubuntu/Debian
sudo apt-get install tesseract-ocr poppler-utils
# macOS
brew install tesseract poppler
# Install Python dependencies
pip install -e ".[dev]"
# Install pre-commit hooks
pre-commit install
# Run tests
pytest
# Run tests with coverage
pytest --cov=docprocessor
```
### Code Quality
We use automated tools to maintain code quality:
```bash
# Format code
black docprocessor tests
# Sort imports
isort docprocessor tests
# Lint
flake8 docprocessor tests
# Type check
mypy docprocessor
# Or run all checks with pre-commit
pre-commit run --all-files
```
### Running Tests
```bash
# Run all tests
pytest
# With coverage report
pytest --cov=docprocessor --cov-report=html
# Run specific test file
pytest tests/test_processor.py -v
# Run tests matching pattern
pytest -k "test_extract" -v
```
## Contributing
We love contributions! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for details on:
- Development setup
- Code style guidelines
- Testing requirements
- Pull request process
- Issue reporting
Quick tips:
- Use the devcontainer for consistent environment
- Write tests for new features
- Follow PEP 8 and use pre-commit hooks
- Update documentation for API changes
- Add entries to CHANGELOG.md
## Changelog
See [CHANGELOG.md](CHANGELOG.md) for version history and release notes.
## License
MIT License - see [LICENSE](LICENSE) file for details.
## Support
- **Issues**: [GitHub Issues](https://github.com/Knowledge-Innovation-Centre/doc-processor/issues)
- **Discussions**: [GitHub Discussions](https://github.com/Knowledge-Innovation-Centre/doc-processor/discussions)
- **Email**: info@knowledgeinnovation.eu
## Citation
If you use docprocessor in your research or project, please cite:
```bibtex
@software{docprocessor2025,
title = {docprocessor: Intelligent Document Processing Library},
author = {Knowledge Innovation Centre},
year = {2025},
url = {https://github.com/Knowledge-Innovation-Centre/doc-processor}
}
```
---
Made with ❤️ by [Knowledge Innovation Centre](https://knowledgeinnovation.eu)
Raw data
{
"_id": null,
"home_page": "https://github.com/Knowledge-Innovation-Centre/doc-processor",
"name": "docprocessor",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "document-processing, ocr, nlp, meilisearch, ai, pdf, chunking, summarization",
"author": "Knowledge Innovation Centre",
"author_email": "Knowledge Innovation Centre <info@knowledgeinnovation.eu>",
"download_url": "https://files.pythonhosted.org/packages/ef/b5/edf0b167995091cee9b739c738a58b433238ada5bf25f1af6978223379c1/docprocessor-1.1.0.tar.gz",
"platform": null,
"description": "# DocProcessor\n\n[](https://www.python.org/downloads/)\n[](https://opensource.org/licenses/MIT)\n[](https://github.com/psf/black)\n[](https://github.com/Knowledge-Innovation-Centre/doc-processor/actions)\n\nA Python library for processing documents with OCR, semantic chunking, and LLM-based summarization. Designed for building semantic search systems and document analysis workflows.\n\n## Table of Contents\n\n- [Features](#features)\n- [Installation](#installation)\n- [Quick Start](#quick-start)\n- [Advanced Usage](#advanced-usage)\n- [API Reference](#api-reference)\n- [Development](#development)\n- [Contributing](#contributing)\n- [License](#license)\n\n## Features\n\n- **Multi-format Support**: PDF, DOCX, PPTX, TXT, MD, and images (PNG, JPG, GIF, BMP)\n- **Intelligent OCR**: Layout-aware PDF text extraction with OCR fallback for images\n- **Semantic Chunking**: Smart text segmentation using LangChain's RecursiveCharacterTextSplitter\n- **LLM Summarization**: Generate concise document summaries (with fallback)\n- **Meilisearch Integration**: Built-in support for indexing to Meilisearch\n- **Flexible API**: Use components individually or as a unified pipeline\n\n## Installation\n\n### From PyPI (Coming Soon)\n\n```bash\npip install docprocessor\n```\n\n### From GitHub\n\n```bash\npip install git+https://github.com/Knowledge-Innovation-Centre/doc-processor.git\n```\n\n### For Development\n\n```bash\ngit clone https://github.com/Knowledge-Innovation-Centre/doc-processor.git\ncd doc-processor\npip install -e \".[dev]\"\n```\n\n### System Dependencies\n\nFor OCR functionality, install system packages:\n\n**Ubuntu/Debian:**\n```bash\nsudo apt-get install tesseract-ocr poppler-utils\n```\n\n**macOS:**\n```bash\nbrew install tesseract poppler\n```\n\n## Quick Start\n\n### Basic Usage\n\n```python\nfrom docprocessor import DocumentProcessor\n\n# Initialize processor\nprocessor = DocumentProcessor()\n\n# Process a document\nresult = processor.process(\n file_path=\"document.pdf\",\n extract_text=True,\n chunk=True,\n summarize=False # Requires LLM client\n)\n\nprint(f\"Extracted {len(result.text)} characters\")\nprint(f\"Created {len(result.chunks)} chunks\")\n```\n\n### With LLM Summarization\n\n```python\nfrom docprocessor import DocumentProcessor\n\n# Your LLM client (must have a complete_chat method)\nclass MyLLMClient:\n def complete_chat(self, messages, temperature):\n # Call your LLM API (OpenAI, Anthropic, Mistral, etc.)\n return {\"content\": \"Generated summary here\"}\n\nllm_client = MyLLMClient()\n\nprocessor = DocumentProcessor(\n llm_client=llm_client,\n summary_target_words=500\n)\n\nresult = processor.process(\n file_path=\"document.pdf\",\n summarize=True\n)\n\nprint(f\"Summary: {result.summary}\")\n```\n\n### With Meilisearch Indexing\n\n```python\nfrom docprocessor import DocumentProcessor, MeiliSearchIndexer\n\n# Process document\nprocessor = DocumentProcessor()\nresult = processor.process(\"document.pdf\", chunk=True)\n\n# Index to Meilisearch\nindexer = MeiliSearchIndexer(\n url=\"http://localhost:7700\",\n api_key=\"your_master_key\",\n index_prefix=\"dev_\" # Optional environment prefix\n)\n\n# Convert chunks to search documents\nsearch_docs = processor.chunks_to_search_documents(result.chunks)\n\n# Index chunks\nindexer.index_chunks(\n chunks=search_docs,\n index_name=\"document_chunks\"\n)\n\n# Search\nresults = indexer.search(\n query=\"artificial intelligence\",\n index_name=\"document_chunks\",\n limit=10\n)\n```\n\n## Advanced Usage\n\n### Custom Chunking Parameters\n\n```python\nprocessor = DocumentProcessor(\n chunk_size=1024, # Larger chunks\n chunk_overlap=100, # More overlap\n min_chunk_size=200 # Higher minimum\n)\n\nchunks = processor.chunk_text(\n text=\"Your long document text here...\",\n filename=\"document.txt\"\n)\n```\n\n### Extract Text Only\n\n```python\nprocessor = DocumentProcessor()\n\nextraction = processor.extract_text(\"document.pdf\")\n\nprint(f\"Text: {extraction['text']}\")\nprint(f\"Pages: {extraction['page_count']}\")\nprint(f\"Format: {extraction['metadata']['format']}\")\n```\n\n### Multi-Environment Indexing\n\n```python\n# Index to multiple environments\nenvironments = {\n \"dev\": {\n \"url\": \"http://localhost:7700\",\n \"api_key\": \"dev_key\",\n \"prefix\": \"dev_\"\n },\n \"prod\": {\n \"url\": \"https://search.production.com\",\n \"api_key\": \"prod_key\",\n \"prefix\": \"prod_\"\n }\n}\n\nfor env_name, config in environments.items():\n indexer = MeiliSearchIndexer(\n url=config[\"url\"],\n api_key=config[\"api_key\"],\n index_prefix=config[\"prefix\"]\n )\n\n indexer.index_chunks(search_docs, \"document_chunks\")\n print(f\"Indexed to {env_name}\")\n```\n\n## API Reference\n\n### DocumentProcessor\n\nMain class for document processing.\n\n**Parameters:**\n- `ocr_enabled` (bool): Enable OCR for PDFs/images. Default: `True`\n- `chunk_size` (int): Target chunk size in tokens. Default: `512`\n- `chunk_overlap` (int): Overlap between chunks. Default: `50`\n- `min_chunk_size` (int): Minimum chunk size. Default: `100`\n- `summary_target_words` (int): Target summary length. Default: `500`\n- `llm_client` (Optional[Any]): LLM client for summarization\n- `llm_temperature` (float): LLM temperature. Default: `0.3`\n\n**Methods:**\n- `process()`: Full pipeline (extract, chunk, summarize)\n- `extract_text()`: Extract text from document\n- `chunk_text()`: Chunk text into segments\n- `summarize_text()`: Generate summary\n- `chunks_to_search_documents()`: Convert chunks for indexing\n\n### MeiliSearchIndexer\n\nInterface for Meilisearch operations.\n\n**Parameters:**\n- `url` (str): Meilisearch server URL\n- `api_key` (str): Meilisearch API key\n- `index_prefix` (Optional[str]): Prefix for index names\n\n**Methods:**\n- `index_chunks()`: Index multiple documents\n- `index_document()`: Index single document\n- `search()`: Search an index\n- `delete_document()`: Delete by ID\n- `delete_documents_by_filter()`: Delete by filter\n- `create_index()`: Create new index\n\n### DocumentChunk\n\nData class representing a text chunk.\n\n**Attributes:**\n- `chunk_id` (str): Unique identifier\n- `file_id` (str): Source file identifier\n- `output_id` (str): Output identifier\n- `project_id` (int): Project identifier\n- `filename` (str): Source filename\n- `chunk_number` (int): Chunk sequence number\n- `total_chunks` (int): Total chunks in document\n- `chunk_text` (str): The chunk text content\n- `token_count` (int): Number of tokens\n- `pages` (List[int]): Page numbers (for PDFs)\n- `metadata` (Dict): Additional metadata\n\n## Architecture\n\nDocProcessor consists of several independent components:\n\n1. **ContentExtractor**: Extracts text from various file formats\n2. **DocumentChunker**: Splits text into semantic segments\n3. **DocumentSummarizer**: Generates LLM-based summaries\n4. **MeiliSearchIndexer**: Indexes documents to Meilisearch\n\nEach component can be used independently or through the unified `DocumentProcessor` API.\n\n## Requirements\n\n**Python**: 3.10+ (tested on 3.10, 3.11, 3.12)\n\n**Core Dependencies:**\n- pdfminer.six - PDF text extraction\n- pdf2image - PDF to image conversion\n- pytesseract - OCR engine\n- opencv-python - Image preprocessing\n- Pillow - Image handling\n- python-docx - DOCX extraction\n- python-pptx - PPTX extraction\n- langchain-text-splitters - Semantic chunking\n- tiktoken - Token counting\n\n**Optional:**\n- meilisearch - Search engine integration\n\n## Examples\n\nSee the `examples/` directory for more usage examples:\n\n- `basic_usage.py` - Simple document processing\n- `multi_environment.py` - Indexing to multiple environments\n- `custom_chunking.py` - Advanced chunking options\n\n## Development\n\n### Using GitHub Codespaces (Recommended)\n\nThe easiest way to start developing:\n\n1. Click the **Code** button on GitHub\n2. Select **Codespaces** \u2192 **Create codespace on main**\n3. Wait for the environment to build (includes all dependencies)\n4. Start coding!\n\nThe devcontainer automatically installs:\n- Python 3.11\n- All system dependencies (Tesseract, Poppler)\n- Python dependencies in editable mode\n- Pre-commit hooks\n- VS Code extensions (Black, isort, flake8, etc.)\n\n### Local Development Setup\n\n```bash\n# Clone the repository\ngit clone https://github.com/Knowledge-Innovation-Centre/doc-processor.git\ncd doc-processor\n\n# Install system dependencies\n# Ubuntu/Debian\nsudo apt-get install tesseract-ocr poppler-utils\n\n# macOS\nbrew install tesseract poppler\n\n# Install Python dependencies\npip install -e \".[dev]\"\n\n# Install pre-commit hooks\npre-commit install\n\n# Run tests\npytest\n\n# Run tests with coverage\npytest --cov=docprocessor\n```\n\n### Code Quality\n\nWe use automated tools to maintain code quality:\n\n```bash\n# Format code\nblack docprocessor tests\n\n# Sort imports\nisort docprocessor tests\n\n# Lint\nflake8 docprocessor tests\n\n# Type check\nmypy docprocessor\n\n# Or run all checks with pre-commit\npre-commit run --all-files\n```\n\n### Running Tests\n\n```bash\n# Run all tests\npytest\n\n# With coverage report\npytest --cov=docprocessor --cov-report=html\n\n# Run specific test file\npytest tests/test_processor.py -v\n\n# Run tests matching pattern\npytest -k \"test_extract\" -v\n```\n\n## Contributing\n\nWe love contributions! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for details on:\n\n- Development setup\n- Code style guidelines\n- Testing requirements\n- Pull request process\n- Issue reporting\n\nQuick tips:\n- Use the devcontainer for consistent environment\n- Write tests for new features\n- Follow PEP 8 and use pre-commit hooks\n- Update documentation for API changes\n- Add entries to CHANGELOG.md\n\n## Changelog\n\nSee [CHANGELOG.md](CHANGELOG.md) for version history and release notes.\n\n## License\n\nMIT License - see [LICENSE](LICENSE) file for details.\n\n## Support\n\n- **Issues**: [GitHub Issues](https://github.com/Knowledge-Innovation-Centre/doc-processor/issues)\n- **Discussions**: [GitHub Discussions](https://github.com/Knowledge-Innovation-Centre/doc-processor/discussions)\n- **Email**: info@knowledgeinnovation.eu\n\n## Citation\n\nIf you use docprocessor in your research or project, please cite:\n\n```bibtex\n@software{docprocessor2025,\n title = {docprocessor: Intelligent Document Processing Library},\n author = {Knowledge Innovation Centre},\n year = {2025},\n url = {https://github.com/Knowledge-Innovation-Centre/doc-processor}\n}\n```\n\n---\n\nMade with \u2764\ufe0f by [Knowledge Innovation Centre](https://knowledgeinnovation.eu)\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Intelligent document processing with OCR, chunking, and AI summarization",
"version": "1.1.0",
"project_urls": {
"Bug Tracker": "https://github.com/Knowledge-Innovation-Centre/doc-processor/issues",
"Changelog": "https://github.com/Knowledge-Innovation-Centre/doc-processor/blob/main/CHANGELOG.md",
"Documentation": "https://github.com/Knowledge-Innovation-Centre/doc-processor#readme",
"Homepage": "https://github.com/Knowledge-Innovation-Centre/doc-processor",
"Repository": "https://github.com/Knowledge-Innovation-Centre/doc-processor"
},
"split_keywords": [
"document-processing",
" ocr",
" nlp",
" meilisearch",
" ai",
" pdf",
" chunking",
" summarization"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "e313842ac72a471d0cb065efc24e373792e7811cf966e83688a77770423428c7",
"md5": "9d7737783691a45a9e8270bf2c9f9b84",
"sha256": "c2f21a0fda1f6aec9cc1dbe19b892b3e3ea767adde46a04bbd344c7396cc078f"
},
"downloads": -1,
"filename": "docprocessor-1.1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "9d7737783691a45a9e8270bf2c9f9b84",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 23487,
"upload_time": "2025-10-30T20:49:10",
"upload_time_iso_8601": "2025-10-30T20:49:10.858246Z",
"url": "https://files.pythonhosted.org/packages/e3/13/842ac72a471d0cb065efc24e373792e7811cf966e83688a77770423428c7/docprocessor-1.1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "efb5edf0b167995091cee9b739c738a58b433238ada5bf25f1af6978223379c1",
"md5": "00296084cc7f74dde6de44bdfb3ec6a3",
"sha256": "6d85fd38839ca2c5d7486d0a30385dea7ac68807d5cc402d25d402b6661b989a"
},
"downloads": -1,
"filename": "docprocessor-1.1.0.tar.gz",
"has_sig": false,
"md5_digest": "00296084cc7f74dde6de44bdfb3ec6a3",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 36287,
"upload_time": "2025-10-30T20:49:12",
"upload_time_iso_8601": "2025-10-30T20:49:12.602656Z",
"url": "https://files.pythonhosted.org/packages/ef/b5/edf0b167995091cee9b739c738a58b433238ada5bf25f1af6978223379c1/docprocessor-1.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-30 20:49:12",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Knowledge-Innovation-Centre",
"github_project": "doc-processor",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "pdfminer.six",
"specs": [
[
">=",
"20221105"
]
]
},
{
"name": "pdf2image",
"specs": [
[
">=",
"1.16.3"
]
]
},
{
"name": "pytesseract",
"specs": [
[
">=",
"0.3.10"
]
]
},
{
"name": "opencv-python",
"specs": [
[
">=",
"4.9.0"
]
]
},
{
"name": "Pillow",
"specs": [
[
">=",
"10.2.0"
]
]
},
{
"name": "python-docx",
"specs": [
[
">=",
"1.1.0"
]
]
},
{
"name": "python-pptx",
"specs": [
[
">=",
"0.6.21"
]
]
},
{
"name": "langchain-text-splitters",
"specs": [
[
">=",
"0.2.0"
]
]
},
{
"name": "tiktoken",
"specs": [
[
">=",
"0.7.0"
]
]
},
{
"name": "numpy",
"specs": [
[
">=",
"1.24.0"
]
]
}
],
"lcname": "docprocessor"
}