# Kreuzberg
[](https://discord.gg/pXxagNK2zN)
[](https://badge.fury.io/py/kreuzberg)
[](https://goldziher.github.io/kreuzberg/)
[](https://opensource.org/licenses/MIT)
**High-performance Python library for text extraction from documents.** Extract text from PDFs, images, office documents, and more with both async and sync APIs.
π **[Complete Documentation](https://goldziher.github.io/kreuzberg/)**
## Why Kreuzberg?
- **π Fastest Performance**: [35+ files/second](https://goldziher.github.io/python-text-extraction-libs-benchmarks/) - the fastest text extraction library
- **πΎ Memory Efficient**: 14x smaller than alternatives (71MB vs 1GB+) with lowest memory usage (~530MB)
- **β‘ Dual APIs**: Only library with both sync and async support
- **π§ Zero Configuration**: Works out of the box with sane defaults
- **π Local Processing**: No cloud dependencies or external API calls
- **π¦ Rich Format Support**: PDFs, images, Office docs, HTML, and more
- **π Multiple OCR Engines**: Tesseract, EasyOCR, and PaddleOCR support
- **π³ Production Ready**: CLI, REST API, and Docker images included
## Quick Start
### Installation
```bash
# Basic installation
pip install kreuzberg
# With optional features
pip install "kreuzberg[cli,api]" # CLI + REST API
pip install "kreuzberg[easyocr,gmft]" # EasyOCR + table extraction
pip install "kreuzberg[all]" # Everything
```
### System Dependencies
```bash
# Ubuntu/Debian
sudo apt-get install tesseract-ocr pandoc
# macOS
brew install tesseract pandoc
# Windows
choco install tesseract pandoc
```
### Basic Usage
```python
import asyncio
from kreuzberg import extract_file
async def main():
# Extract from any document type
result = await extract_file("document.pdf")
print(result.content)
print(result.metadata)
asyncio.run(main())
```
## Deployment Options
### π³ Docker (Recommended)
```bash
# Run API server
docker run -p 8000:8000 goldziher/kreuzberg:latest
# Extract files
curl -X POST http://localhost:8000/extract -F "data=@document.pdf"
```
Available variants: `latest`, `3.6.1`, `3.6.1-easyocr`, `3.6.1-paddle`, `3.6.1-gmft`, `3.6.1-all`
### π REST API
```bash
# Install and run
pip install "kreuzberg[api]"
litestar --app kreuzberg._api.main:app run
# Health check
curl http://localhost:8000/health
# Extract files
curl -X POST http://localhost:8000/extract -F "data=@file.pdf"
```
### π» Command Line
```bash
# Install CLI
pip install "kreuzberg[cli]"
# Extract to stdout
kreuzberg extract document.pdf
# JSON output with metadata
kreuzberg extract document.pdf --output-format json --show-metadata
# Batch processing
kreuzberg extract *.pdf --output-dir ./extracted/
```
## Supported Formats
| Category | Formats |
| ----------------- | ------------------------------ |
| **Documents** | PDF, DOCX, DOC, RTF, TXT, EPUB |
| **Images** | JPG, PNG, TIFF, BMP, GIF, WEBP |
| **Spreadsheets** | XLSX, XLS, CSV, ODS |
| **Presentations** | PPTX, PPT, ODP |
| **Web** | HTML, XML, MHTML |
| **Archives** | Support via extraction |
## Performance
**[Comprehensive benchmarks](https://goldziher.github.io/python-text-extraction-libs-benchmarks/)** across 94 real-world documents (~210MB) β’ [View source](https://github.com/Goldziher/python-text-extraction-libs-benchmarks):
| Library | Speed | Memory | Install Size | Dependencies | Success Rate |
| ------------- | --------------- | --------- | ------------ | ------------ | ------------ |
| **Kreuzberg** | **35+ files/s** | **530MB** | **71MB** | **20** | High\* |
| Unstructured | Moderate | ~1GB | 146MB | 54 | 88%+ |
| MarkItDown | Goodβ | ~1.5GB | 251MB | 25 | 80%β |
| Docling | 60+ min/fileβ‘ | ~5GB | 1,032MB | 88 | Lowβ‘ |
\*_Can achieve 75% reliability with 15% performance trade-off when configured_
β _Good on simple documents, struggles with large/complex files (>10MB)_
β‘_Frequently fails/times out on medium files (>1MB)_
> **Benchmark details**: Tested across PDFs, Word docs, HTML, images, spreadsheets in 6 languages (English, Hebrew, German, Chinese, Japanese, Korean)
> **Rule of thumb**: Use async API for complex documents and batch processing (up to 4.5x faster)
## Documentation
### Quick Links
- [Installation Guide](https://goldziher.github.io/kreuzberg/getting-started/installation/) - Setup and dependencies
- [User Guide](https://goldziher.github.io/kreuzberg/user-guide/) - Comprehensive usage guide
- [API Reference](https://goldziher.github.io/kreuzberg/api-reference/) - Complete API documentation
- [Docker Guide](https://goldziher.github.io/kreuzberg/user-guide/docker/) - Container deployment
- [REST API](https://goldziher.github.io/kreuzberg/user-guide/api-server/) - HTTP endpoints
- [CLI Guide](https://goldziher.github.io/kreuzberg/cli/) - Command-line usage
- [OCR Configuration](https://goldziher.github.io/kreuzberg/user-guide/ocr-configuration/) - OCR engine setup
## Advanced Features
- **π Table Extraction**: Extract tables from PDFs with GMFT
- **π§© Content Chunking**: Split documents for RAG applications
- **π― Custom Extractors**: Extend with your own document handlers
- **π§ Configuration**: Flexible TOML-based configuration
- **πͺ Hooks**: Pre/post-processing customization
- **π Multi-language OCR**: 100+ languages supported
- **βοΈ Metadata Extraction**: Rich document metadata
- **π Batch Processing**: Efficient bulk document processing
## License
MIT License - see [LICENSE](LICENSE) for details.
______________________________________________________________________
<div align="center">
**[Documentation](https://goldziher.github.io/kreuzberg/) β’ [PyPI](https://pypi.org/project/kreuzberg/) β’ [Docker Hub](https://hub.docker.com/r/goldziher/kreuzberg) β’ [Benchmarks](https://github.com/Goldziher/python-text-extraction-libs-benchmarks) β’ [Discord](https://discord.gg/pXxagNK2zN)**
Made with β€οΈ by the [Kreuzberg contributors](https://github.com/Goldziher/kreuzberg/graphs/contributors)
</div>
Raw data
{
"_id": null,
"home_page": null,
"name": "kreuzberg",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": "document-processing, entity-extraction, image-to-text, keyword-extraction, named-entity-recognition, ner, ocr, pandoc, pdf-extraction, rag, spacy, table-extraction, tesseract, text-extraction, text-processing",
"author": null,
"author_email": "Na'aman Hirschfeld <nhirschfed@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/0c/a2/5b53128503967432369fca403f86fc6687d987f7ee13058ce47f1d60caee/kreuzberg-3.6.2.tar.gz",
"platform": null,
"description": "# Kreuzberg\n\n[](https://discord.gg/pXxagNK2zN)\n[](https://badge.fury.io/py/kreuzberg)\n[](https://goldziher.github.io/kreuzberg/)\n[](https://opensource.org/licenses/MIT)\n\n**High-performance Python library for text extraction from documents.** Extract text from PDFs, images, office documents, and more with both async and sync APIs.\n\n\ud83d\udcd6 **[Complete Documentation](https://goldziher.github.io/kreuzberg/)**\n\n## Why Kreuzberg?\n\n- **\ud83d\ude80 Fastest Performance**: [35+ files/second](https://goldziher.github.io/python-text-extraction-libs-benchmarks/) - the fastest text extraction library\n- **\ud83d\udcbe Memory Efficient**: 14x smaller than alternatives (71MB vs 1GB+) with lowest memory usage (~530MB)\n- **\u26a1 Dual APIs**: Only library with both sync and async support\n- **\ud83d\udd27 Zero Configuration**: Works out of the box with sane defaults\n- **\ud83c\udfe0 Local Processing**: No cloud dependencies or external API calls\n- **\ud83d\udce6 Rich Format Support**: PDFs, images, Office docs, HTML, and more\n- **\ud83d\udd0d Multiple OCR Engines**: Tesseract, EasyOCR, and PaddleOCR support\n- **\ud83d\udc33 Production Ready**: CLI, REST API, and Docker images included\n\n## Quick Start\n\n### Installation\n\n```bash\n# Basic installation\npip install kreuzberg\n\n# With optional features\npip install \"kreuzberg[cli,api]\" # CLI + REST API\npip install \"kreuzberg[easyocr,gmft]\" # EasyOCR + table extraction\npip install \"kreuzberg[all]\" # Everything\n```\n\n### System Dependencies\n\n```bash\n# Ubuntu/Debian\nsudo apt-get install tesseract-ocr pandoc\n\n# macOS\nbrew install tesseract pandoc\n\n# Windows\nchoco install tesseract pandoc\n```\n\n### Basic Usage\n\n```python\nimport asyncio\nfrom kreuzberg import extract_file\n\nasync def main():\n # Extract from any document type\n result = await extract_file(\"document.pdf\")\n print(result.content)\n print(result.metadata)\n\nasyncio.run(main())\n```\n\n## Deployment Options\n\n### \ud83d\udc33 Docker (Recommended)\n\n```bash\n# Run API server\ndocker run -p 8000:8000 goldziher/kreuzberg:latest\n\n# Extract files\ncurl -X POST http://localhost:8000/extract -F \"data=@document.pdf\"\n```\n\nAvailable variants: `latest`, `3.6.1`, `3.6.1-easyocr`, `3.6.1-paddle`, `3.6.1-gmft`, `3.6.1-all`\n\n### \ud83c\udf10 REST API\n\n```bash\n# Install and run\npip install \"kreuzberg[api]\"\nlitestar --app kreuzberg._api.main:app run\n\n# Health check\ncurl http://localhost:8000/health\n\n# Extract files\ncurl -X POST http://localhost:8000/extract -F \"data=@file.pdf\"\n```\n\n### \ud83d\udcbb Command Line\n\n```bash\n# Install CLI\npip install \"kreuzberg[cli]\"\n\n# Extract to stdout\nkreuzberg extract document.pdf\n\n# JSON output with metadata\nkreuzberg extract document.pdf --output-format json --show-metadata\n\n# Batch processing\nkreuzberg extract *.pdf --output-dir ./extracted/\n```\n\n## Supported Formats\n\n| Category | Formats |\n| ----------------- | ------------------------------ |\n| **Documents** | PDF, DOCX, DOC, RTF, TXT, EPUB |\n| **Images** | JPG, PNG, TIFF, BMP, GIF, WEBP |\n| **Spreadsheets** | XLSX, XLS, CSV, ODS |\n| **Presentations** | PPTX, PPT, ODP |\n| **Web** | HTML, XML, MHTML |\n| **Archives** | Support via extraction |\n\n## Performance\n\n**[Comprehensive benchmarks](https://goldziher.github.io/python-text-extraction-libs-benchmarks/)** across 94 real-world documents (~210MB) \u2022 [View source](https://github.com/Goldziher/python-text-extraction-libs-benchmarks):\n\n| Library | Speed | Memory | Install Size | Dependencies | Success Rate |\n| ------------- | --------------- | --------- | ------------ | ------------ | ------------ |\n| **Kreuzberg** | **35+ files/s** | **530MB** | **71MB** | **20** | High\\* |\n| Unstructured | Moderate | ~1GB | 146MB | 54 | 88%+ |\n| MarkItDown | Good\u2020 | ~1.5GB | 251MB | 25 | 80%\u2020 |\n| Docling | 60+ min/file\u2021 | ~5GB | 1,032MB | 88 | Low\u2021 |\n\n\\*_Can achieve 75% reliability with 15% performance trade-off when configured_\n\u2020_Good on simple documents, struggles with large/complex files (>10MB)_\n\u2021_Frequently fails/times out on medium files (>1MB)_\n\n> **Benchmark details**: Tested across PDFs, Word docs, HTML, images, spreadsheets in 6 languages (English, Hebrew, German, Chinese, Japanese, Korean)\n> **Rule of thumb**: Use async API for complex documents and batch processing (up to 4.5x faster)\n\n## Documentation\n\n### Quick Links\n\n- [Installation Guide](https://goldziher.github.io/kreuzberg/getting-started/installation/) - Setup and dependencies\n- [User Guide](https://goldziher.github.io/kreuzberg/user-guide/) - Comprehensive usage guide\n- [API Reference](https://goldziher.github.io/kreuzberg/api-reference/) - Complete API documentation\n- [Docker Guide](https://goldziher.github.io/kreuzberg/user-guide/docker/) - Container deployment\n- [REST API](https://goldziher.github.io/kreuzberg/user-guide/api-server/) - HTTP endpoints\n- [CLI Guide](https://goldziher.github.io/kreuzberg/cli/) - Command-line usage\n- [OCR Configuration](https://goldziher.github.io/kreuzberg/user-guide/ocr-configuration/) - OCR engine setup\n\n## Advanced Features\n\n- **\ud83d\udcca Table Extraction**: Extract tables from PDFs with GMFT\n- **\ud83e\udde9 Content Chunking**: Split documents for RAG applications\n- **\ud83c\udfaf Custom Extractors**: Extend with your own document handlers\n- **\ud83d\udd27 Configuration**: Flexible TOML-based configuration\n- **\ud83e\ude9d Hooks**: Pre/post-processing customization\n- **\ud83c\udf0d Multi-language OCR**: 100+ languages supported\n- **\u2699\ufe0f Metadata Extraction**: Rich document metadata\n- **\ud83d\udd04 Batch Processing**: Efficient bulk document processing\n\n## License\n\nMIT License - see [LICENSE](LICENSE) for details.\n\n______________________________________________________________________\n\n<div align=\"center\">\n\n**[Documentation](https://goldziher.github.io/kreuzberg/) \u2022 [PyPI](https://pypi.org/project/kreuzberg/) \u2022 [Docker Hub](https://hub.docker.com/r/goldziher/kreuzberg) \u2022 [Benchmarks](https://github.com/Goldziher/python-text-extraction-libs-benchmarks) \u2022 [Discord](https://discord.gg/pXxagNK2zN)**\n\nMade with \u2764\ufe0f by the [Kreuzberg contributors](https://github.com/Goldziher/kreuzberg/graphs/contributors)\n\n</div>\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A text extraction library supporting PDFs, images, office documents and more",
"version": "3.6.2",
"project_urls": {
"homepage": "https://github.com/Goldziher/kreuzberg"
},
"split_keywords": [
"document-processing",
" entity-extraction",
" image-to-text",
" keyword-extraction",
" named-entity-recognition",
" ner",
" ocr",
" pandoc",
" pdf-extraction",
" rag",
" spacy",
" table-extraction",
" tesseract",
" text-extraction",
" text-processing"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "689d80702fabb1fd451e183cb52c48b5535b3eeb19b7a6b4f66e3d8e61e8a227",
"md5": "02881e050138370139251ea8e01da13b",
"sha256": "36bba10f46a0347993c50c08c0f6c731bf528cf507e669781a7cb3eecb1daec0"
},
"downloads": -1,
"filename": "kreuzberg-3.6.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "02881e050138370139251ea8e01da13b",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 95190,
"upload_time": "2025-07-11T07:21:56",
"upload_time_iso_8601": "2025-07-11T07:21:56.259517Z",
"url": "https://files.pythonhosted.org/packages/68/9d/80702fabb1fd451e183cb52c48b5535b3eeb19b7a6b4f66e3d8e61e8a227/kreuzberg-3.6.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "0ca25b53128503967432369fca403f86fc6687d987f7ee13058ce47f1d60caee",
"md5": "0223cb856900b324f298edc67d8241d4",
"sha256": "4865e4a0be85e72867791dc19bbb8a7c8c4853900b59cb4cbef5bfaecafa1626"
},
"downloads": -1,
"filename": "kreuzberg-3.6.2.tar.gz",
"has_sig": false,
"md5_digest": "0223cb856900b324f298edc67d8241d4",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 9389554,
"upload_time": "2025-07-11T07:21:57",
"upload_time_iso_8601": "2025-07-11T07:21:57.560903Z",
"url": "https://files.pythonhosted.org/packages/0c/a2/5b53128503967432369fca403f86fc6687d987f7ee13058ce47f1d60caee/kreuzberg-3.6.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-11 07:21:57",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Goldziher",
"github_project": "kreuzberg",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "kreuzberg"
}