Name | docler JSON |
Version |
0.5.0
JSON |
| download |
home_page | None |
Summary | Abstractions & Tools for OCR / document processing |
upload_time | 2025-10-06 20:36:08 |
maintainer | None |
docs_url | None |
author | Philipp Temminghoff |
requires_python | >=3.12 |
license | MIT License
Copyright (c) 2024, Philipp Temminghoff
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
|
keywords |
|
VCS |
 |
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# Docler
[](https://pypi.org/project/docler/)
[](https://pypi.org/project/docler/)
[](https://pypi.org/project/docler/)
[](https://pypi.org/project/docler/)
[](https://pypi.org/project/docler/)
[](https://pypi.org/project/docler/)
[](https://pypi.org/project/docler/)
[](https://github.com/phil65/docler/releases)
[](https://github.com/phil65/docler/graphs/contributors)
[](https://github.com/phil65/docler/discussions)
[](https://github.com/phil65/docler/forks)
[](https://github.com/phil65/docler/issues)
[](https://github.com/phil65/docler/pulls)
[](https://github.com/phil65/docler/watchers)
[](https://github.com/phil65/docler/stars)
[](https://github.com/phil65/docler)
[](https://github.com/phil65/docler/commits)
[](https://github.com/phil65/docler/releases)
[](https://github.com/phil65/docler)
[](https://github.com/phil65/docler)
[](https://codecov.io/gh/phil65/docler/)
[](https://pyup.io/repos/github/phil65/docler/)
[Read the documentation!](https://phil65.github.io/docler/)
A unified Python library for document conversion and OCR that provides a consistent interface to multiple document processing providers. Extract text, images, and metadata from PDFs, images, and office documents using state-of-the-art OCR and document AI services.
## Features
- **Unified Interface**: Single API for multiple document processing providers
- **Multiple Providers**: Support for 10+ OCR and document AI services
- **Rich Output**: Extract text, images, tables, and metadata
- **Async Support**: Built-in async/await support
- **Flexible Configuration**: Provider-specific settings and preferences
- **Page Range Support**: Process specific pages from documents
- **Multi-language OCR**: Support for 100+ languages across providers
- **Structured Output**: Standardized markdown with embedded metadata
## Quick Start
```python
import asyncio
from docler import MistralConverter
async def main():
# Use the aggregated converter for automatic provider selection
converter = MistralConverter()
# Convert a document
result = await converter.convert_file("document.pdf")
print(f"Title: {result.title}")
print(f"Content: {result.content[:500]}...")
print(f"Images: {len(result.images)} extracted")
print(f"Pages: {result.page_count}")
asyncio.run(main())
```
## Available OCR Converters
### Cloud API Providers
#### Azure Document Intelligence
```python
from docler import AzureConverter
converter = AzureConverter(
endpoint="your-endpoint",
api_key="your-key",
model="prebuilt-layout"
)
```
#### Mistral OCR
```python
from docler import MistralConverter
converter = MistralConverter(
api_key="your-key",
languages=["en", "fr", "de"]
)
```
#### LlamaParse
```python
from docler import LlamaParseConverter
converter = LlamaParseConverter(
api_key="your-key",
adaptive_long_table=True
)
```
#### Upstage Document AI
```python
from docler import UpstageConverter
converter = UpstageConverter(
api_key="your-key",
chart_recognition=True
)
```
#### DataLab
```python
from docler import DataLabConverter
converter = DataLabConverter(
api_key="your-key",
use_llm=False # Enable for higher accuracy
)
```
### Local/Self-Hosted Providers
#### Marker
```python
from docler import MarkerConverter
converter = MarkerConverter(
dpi=192,
use_llm=True, # Requires local LLM setup
llm_provider="ollama"
)
```
#### Docling
```python
from docler import DoclingConverter
converter = DoclingConverter(
ocr_engine="easy_ocr",
image_scale=2.0
)
```
#### Docling Remote
```python
from docler import DoclingRemoteConverter
converter = DoclingRemoteConverter(
endpoint="http://localhost:5001",
pdf_backend="dlparse_v4"
)
```
#### MarkItDown (Microsoft)
```python
from docler import MarkItDownConverter
converter = MarkItDownConverter()
```
### LLM-Based Providers
#### LLM Converter
```python
from docler import LLMConverter
converter = LLMConverter(
model="gpt-4o", # or claude-3-5-sonnet, etc.
system_prompt="Extract text preserving formatting..."
)
```
## Provider Comparison
| Provider | Cost/Page | Local | API Required | Best For |
|----------|-----------|-------|--------------|----------|
| **Azure** | $0.0096 | ❌ | ✅ | Enterprise forms, invoices |
| **Mistral** | Variable | ❌ | ✅ | High-quality text extraction |
| **LlamaParse** | $0.0045 | ❌ | ✅ | Complex layouts, academic papers |
| **Upstage** | $0.01 | ❌ | ✅ | Charts, presentations |
| **DataLab** | $0.0015 | ❌ | ✅ | Cost-effective processing |
| **Marker** | Free | ✅ | ❌ | Privacy-sensitive documents |
| **Docling** | Free | ✅ | ❌ | Open-source processing |
| **MarkItDown** | Free | ✅ | ❌ | Office documents |
| **LLM** | Variable | ❌ | ✅ | Latest AI capabilities |
## Advanced Usage
### Directory Processing
Process entire directories with progress tracking:
```python
from docler import DirectoryConverter, MarkerConverter
base_converter = MarkerConverter()
dir_converter = DirectoryConverter(base_converter, chunk_size=10)
# Convert all supported files
results = await dir_converter.convert("./documents/")
# Or with progress tracking
async for state in dir_converter.convert_with_progress("./documents/"):
print(f"Progress: {state.processed_files}/{state.total_files}")
print(f"Current: {state.current_file}")
if state.errors:
print(f"Errors: {len(state.errors)}")
```
### Page Range Processing
Extract specific pages from documents:
```python
# Extract pages 1-5 and 10-15
converter = MistralConverter(page_range="1-5,10-15")
result = await converter.convert_file("large_document.pdf")
```
### Batch Processing
Process multiple files efficiently:
```python
files = ["doc1.pdf", "doc2.png", "doc3.docx"]
results = await converter.convert_files(files)
for file, result in zip(files, results):
print(f"{file}: {len(result.content)} characters extracted")
```
## Output Format
All converters return a standardized `Document` object with:
```python
class Document:
content: str # Extracted text in markdown format
images: list[Image] # Extracted images with metadata
title: str # Document title
source_path: str # Original file path
mime_type: str # File MIME type
metadata: dict # Provider-specific metadata
page_count: int # Number of pages processed
```
The markdown content includes standardized metadata for page breaks and structure:
```markdown
<!-- docler:page_break {"next_page":1} -->
# Document Title
Content from page 1...
<!-- docler:page_break {"next_page":2} -->
More content from page 2...
```
## Installation
```bash
# Basic installation
pip install docler
# With specific provider dependencies
pip install docler[azure] # Azure Document Intelligence
pip install docler[mistral] # Mistral OCR
pip install docler[marker] # Marker PDF processing
pip install docler[all] # All providers
```
## Environment Variables
Configure API keys via environment variables:
```bash
export AZURE_DOC_INTELLIGENCE_ENDPOINT="your-endpoint"
export AZURE_DOC_INTELLIGENCE_KEY="your-key"
export MISTRAL_API_KEY="your-key"
export LLAMAPARSE_API_KEY="your-key"
export UPSTAGE_API_KEY="your-key"
export DATALAB_API_KEY="your-key"
```
## Contributing
We welcome contributions! See our [contributing guidelines](CONTRIBUTING.md) for details.
## License
MIT License - see [LICENSE](LICENSE) for details.
## Links
- **Documentation**: https://phil65.github.io/docler/
- **PyPI**: https://pypi.org/project/docler/
- **GitHub**: https://github.com/phil65/docler/
- **Issues**: https://github.com/phil65/docler/issues
- **Discussions**: https://github.com/phil65/docler/discussions
---
**Coming Soon**: FastAPI demo with bring-your-own-keys on https://contexter.net
Raw data
{
"_id": null,
"home_page": null,
"name": "docler",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.12",
"maintainer_email": null,
"keywords": null,
"author": "Philipp Temminghoff",
"author_email": "Philipp Temminghoff <philipptemminghoff@googlemail.com>",
"download_url": "https://files.pythonhosted.org/packages/ea/18/94dff38279dab9c50e04b91193aafa21e593db7fdafcda5b846b98f5ddbd/docler-0.5.0.tar.gz",
"platform": null,
"description": "# Docler\n\n[](https://pypi.org/project/docler/)\n[](https://pypi.org/project/docler/)\n[](https://pypi.org/project/docler/)\n[](https://pypi.org/project/docler/)\n[](https://pypi.org/project/docler/)\n[](https://pypi.org/project/docler/)\n[](https://pypi.org/project/docler/)\n[](https://github.com/phil65/docler/releases)\n[](https://github.com/phil65/docler/graphs/contributors)\n[](https://github.com/phil65/docler/discussions)\n[](https://github.com/phil65/docler/forks)\n[](https://github.com/phil65/docler/issues)\n[](https://github.com/phil65/docler/pulls)\n[](https://github.com/phil65/docler/watchers)\n[](https://github.com/phil65/docler/stars)\n[](https://github.com/phil65/docler)\n[](https://github.com/phil65/docler/commits)\n[](https://github.com/phil65/docler/releases)\n[](https://github.com/phil65/docler)\n[](https://github.com/phil65/docler)\n[](https://codecov.io/gh/phil65/docler/)\n[](https://pyup.io/repos/github/phil65/docler/)\n\n[Read the documentation!](https://phil65.github.io/docler/)\n\nA unified Python library for document conversion and OCR that provides a consistent interface to multiple document processing providers. Extract text, images, and metadata from PDFs, images, and office documents using state-of-the-art OCR and document AI services.\n\n## Features\n\n- **Unified Interface**: Single API for multiple document processing providers\n- **Multiple Providers**: Support for 10+ OCR and document AI services\n- **Rich Output**: Extract text, images, tables, and metadata\n- **Async Support**: Built-in async/await support\n- **Flexible Configuration**: Provider-specific settings and preferences\n- **Page Range Support**: Process specific pages from documents\n- **Multi-language OCR**: Support for 100+ languages across providers\n- **Structured Output**: Standardized markdown with embedded metadata\n\n## Quick Start\n\n```python\nimport asyncio\nfrom docler import MistralConverter\n\nasync def main():\n # Use the aggregated converter for automatic provider selection\n converter = MistralConverter()\n\n # Convert a document\n result = await converter.convert_file(\"document.pdf\")\n\n print(f\"Title: {result.title}\")\n print(f\"Content: {result.content[:500]}...\")\n print(f\"Images: {len(result.images)} extracted\")\n print(f\"Pages: {result.page_count}\")\n\nasyncio.run(main())\n```\n\n## Available OCR Converters\n\n### Cloud API Providers\n\n#### Azure Document Intelligence\n\n```python\nfrom docler import AzureConverter\n\nconverter = AzureConverter(\n endpoint=\"your-endpoint\",\n api_key=\"your-key\",\n model=\"prebuilt-layout\"\n)\n```\n\n#### Mistral OCR\n\n```python\nfrom docler import MistralConverter\n\nconverter = MistralConverter(\n api_key=\"your-key\",\n languages=[\"en\", \"fr\", \"de\"]\n)\n```\n\n#### LlamaParse\n\n```python\nfrom docler import LlamaParseConverter\n\nconverter = LlamaParseConverter(\n api_key=\"your-key\",\n adaptive_long_table=True\n)\n```\n\n#### Upstage Document AI\n\n```python\nfrom docler import UpstageConverter\n\nconverter = UpstageConverter(\n api_key=\"your-key\",\n chart_recognition=True\n)\n```\n\n#### DataLab\n\n```python\nfrom docler import DataLabConverter\n\nconverter = DataLabConverter(\n api_key=\"your-key\",\n use_llm=False # Enable for higher accuracy\n)\n```\n\n### Local/Self-Hosted Providers\n\n#### Marker\n\n```python\nfrom docler import MarkerConverter\n\nconverter = MarkerConverter(\n dpi=192,\n use_llm=True, # Requires local LLM setup\n llm_provider=\"ollama\"\n)\n```\n\n#### Docling\n\n```python\nfrom docler import DoclingConverter\n\nconverter = DoclingConverter(\n ocr_engine=\"easy_ocr\",\n image_scale=2.0\n)\n```\n\n#### Docling Remote\n\n```python\nfrom docler import DoclingRemoteConverter\n\nconverter = DoclingRemoteConverter(\n endpoint=\"http://localhost:5001\",\n pdf_backend=\"dlparse_v4\"\n)\n```\n\n#### MarkItDown (Microsoft)\n\n```python\nfrom docler import MarkItDownConverter\n\nconverter = MarkItDownConverter()\n```\n\n### LLM-Based Providers\n\n#### LLM Converter\n\n```python\nfrom docler import LLMConverter\n\nconverter = LLMConverter(\n model=\"gpt-4o\", # or claude-3-5-sonnet, etc.\n system_prompt=\"Extract text preserving formatting...\"\n)\n```\n\n## Provider Comparison\n\n| Provider | Cost/Page | Local | API Required | Best For |\n|----------|-----------|-------|--------------|----------|\n| **Azure** | $0.0096 | \u274c | \u2705 | Enterprise forms, invoices |\n| **Mistral** | Variable | \u274c | \u2705 | High-quality text extraction |\n| **LlamaParse** | $0.0045 | \u274c | \u2705 | Complex layouts, academic papers |\n| **Upstage** | $0.01 | \u274c | \u2705 | Charts, presentations |\n| **DataLab** | $0.0015 | \u274c | \u2705 | Cost-effective processing |\n| **Marker** | Free | \u2705 | \u274c | Privacy-sensitive documents |\n| **Docling** | Free | \u2705 | \u274c | Open-source processing |\n| **MarkItDown** | Free | \u2705 | \u274c | Office documents |\n| **LLM** | Variable | \u274c | \u2705 | Latest AI capabilities |\n\n## Advanced Usage\n\n### Directory Processing\n\nProcess entire directories with progress tracking:\n\n```python\nfrom docler import DirectoryConverter, MarkerConverter\n\nbase_converter = MarkerConverter()\ndir_converter = DirectoryConverter(base_converter, chunk_size=10)\n\n# Convert all supported files\nresults = await dir_converter.convert(\"./documents/\")\n\n# Or with progress tracking\nasync for state in dir_converter.convert_with_progress(\"./documents/\"):\n print(f\"Progress: {state.processed_files}/{state.total_files}\")\n print(f\"Current: {state.current_file}\")\n if state.errors:\n print(f\"Errors: {len(state.errors)}\")\n```\n\n### Page Range Processing\n\nExtract specific pages from documents:\n\n```python\n# Extract pages 1-5 and 10-15\nconverter = MistralConverter(page_range=\"1-5,10-15\")\nresult = await converter.convert_file(\"large_document.pdf\")\n```\n\n### Batch Processing\n\nProcess multiple files efficiently:\n\n```python\nfiles = [\"doc1.pdf\", \"doc2.png\", \"doc3.docx\"]\nresults = await converter.convert_files(files)\n\nfor file, result in zip(files, results):\n print(f\"{file}: {len(result.content)} characters extracted\")\n```\n\n## Output Format\n\nAll converters return a standardized `Document` object with:\n\n```python\nclass Document:\n content: str # Extracted text in markdown format\n images: list[Image] # Extracted images with metadata\n title: str # Document title\n source_path: str # Original file path\n mime_type: str # File MIME type\n metadata: dict # Provider-specific metadata\n page_count: int # Number of pages processed\n```\n\nThe markdown content includes standardized metadata for page breaks and structure:\n\n```markdown\n<!-- docler:page_break {\"next_page\":1} -->\n# Document Title\n\nContent from page 1...\n\n<!-- docler:page_break {\"next_page\":2} -->\nMore content from page 2...\n```\n\n## Installation\n\n```bash\n# Basic installation\npip install docler\n\n# With specific provider dependencies\npip install docler[azure] # Azure Document Intelligence\npip install docler[mistral] # Mistral OCR\npip install docler[marker] # Marker PDF processing\npip install docler[all] # All providers\n```\n\n## Environment Variables\n\nConfigure API keys via environment variables:\n\n```bash\nexport AZURE_DOC_INTELLIGENCE_ENDPOINT=\"your-endpoint\"\nexport AZURE_DOC_INTELLIGENCE_KEY=\"your-key\"\nexport MISTRAL_API_KEY=\"your-key\"\nexport LLAMAPARSE_API_KEY=\"your-key\"\nexport UPSTAGE_API_KEY=\"your-key\"\nexport DATALAB_API_KEY=\"your-key\"\n```\n\n## Contributing\n\nWe welcome contributions! See our [contributing guidelines](CONTRIBUTING.md) for details.\n\n## License\n\nMIT License - see [LICENSE](LICENSE) for details.\n\n## Links\n\n- **Documentation**: https://phil65.github.io/docler/\n- **PyPI**: https://pypi.org/project/docler/\n- **GitHub**: https://github.com/phil65/docler/\n- **Issues**: https://github.com/phil65/docler/issues\n- **Discussions**: https://github.com/phil65/docler/discussions\n\n---\n\n**Coming Soon**: FastAPI demo with bring-your-own-keys on https://contexter.net\n",
"bugtrack_url": null,
"license": "MIT License\n \n Copyright (c) 2024, Philipp Temminghoff\n \n Permission is hereby granted, free of charge, to any person obtaining a copy\n of this software and associated documentation files (the \"Software\"), to deal\n in the Software without restriction, including without limitation the rights\n to use, copy, modify, merge, publish, distribute, sublicense, and/or sell\n copies of the Software, and to permit persons to whom the Software is\n furnished to do so, subject to the following conditions:\n \n The above copyright notice and this permission notice shall be included in all\n copies or substantial portions of the Software.\n \n THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\n IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\n FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\n AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\n LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\n OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\n SOFTWARE.\n ",
"summary": "Abstractions & Tools for OCR / document processing",
"version": "0.5.0",
"project_urls": {
"Code coverage": "https://app.codecov.io/gh/phil65/docler",
"Discussions": "https://github.com/phil65/docler/discussions",
"Documentation": "https://phil65.github.io/docler/",
"Issues": "https://github.com/phil65/docler/issues",
"Source": "https://github.com/phil65/docler"
},
"split_keywords": [],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "148694b9b63ba78e0e716a008f987befff04fead87c844ef5b7750498b4175ac",
"md5": "614d46e01462fe306e1e16cadeef4b1f",
"sha256": "eef62b600101332938adf7ca3f70323ea5fe0ac25beec18c21c098872b2fec2d"
},
"downloads": -1,
"filename": "docler-0.5.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "614d46e01462fe306e1e16cadeef4b1f",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.12",
"size": 1480521,
"upload_time": "2025-10-06T20:36:07",
"upload_time_iso_8601": "2025-10-06T20:36:07.040405Z",
"url": "https://files.pythonhosted.org/packages/14/86/94b9b63ba78e0e716a008f987befff04fead87c844ef5b7750498b4175ac/docler-0.5.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "ea1894dff38279dab9c50e04b91193aafa21e593db7fdafcda5b846b98f5ddbd",
"md5": "0fbc81bcba33ce5dabf926306e9df455",
"sha256": "639d77a3b8915656414ea6ff561ef2e815683a6fc33955a113159c11d4f9f03a"
},
"downloads": -1,
"filename": "docler-0.5.0.tar.gz",
"has_sig": false,
"md5_digest": "0fbc81bcba33ce5dabf926306e9df455",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.12",
"size": 1425750,
"upload_time": "2025-10-06T20:36:08",
"upload_time_iso_8601": "2025-10-06T20:36:08.862203Z",
"url": "https://files.pythonhosted.org/packages/ea/18/94dff38279dab9c50e04b91193aafa21e593db7fdafcda5b846b98f5ddbd/docler-0.5.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-06 20:36:08",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "phil65",
"github_project": "docler",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "docler"
}