# Kreuzberg
[](https://discord.gg/pXxagNK2zN)
[](https://badge.fury.io/py/kreuzberg)
[](https://kreuzberg.dev/)
[](https://benchmarks.kreuzberg.dev/)
[](https://opensource.org/licenses/MIT)
[](https://github.com/Goldziher/kreuzberg)
**A document intelligence framework for Python.** Extract text, metadata, and structured information from diverse document formats through a unified, extensible API. Built on established open source foundations including Pandoc, PDFium, and Tesseract.
📖 **[Complete Documentation](https://kreuzberg.dev/)**
## Framework Overview
### Document Intelligence Capabilities
- **Text Extraction**: High-fidelity text extraction preserving document structure and formatting
- **Metadata Extraction**: Comprehensive metadata including author, creation date, language, and document properties
- **Format Support**: 18 document types including PDF, Microsoft Office, images, HTML, and structured data formats
- **OCR Integration**: Multiple OCR engines (Tesseract, EasyOCR, PaddleOCR) with automatic fallback
- **Table Detection**: Structured table extraction with cell-level precision via GMFT integration
### Technical Architecture
- **Performance**: Highest throughput among Python document processing frameworks (30+ docs/second)
- **Resource Efficiency**: 71MB installation, ~360MB runtime memory footprint
- **Extensibility**: Plugin architecture for custom extractors via the Extractor base class
- **API Design**: Synchronous and asynchronous APIs with consistent interfaces
- **Type Safety**: Complete type annotations throughout the codebase
### Open Source Foundation
Kreuzberg leverages established open source technologies:
- **Pandoc**: Universal document converter for robust format support
- **PDFium**: Google's PDF rendering engine for accurate PDF processing
- **Tesseract**: Google's OCR engine for text recognition
- **Python-docx/pptx**: Native Microsoft Office format support
## Quick Start
### Extract Text with CLI
```bash
# Extract text from any file to markdown
uvx kreuzberg extract document.pdf > output.md
# With all features (OCR, table extraction, etc.)
uvx --from "kreuzberg[all]" kreuzberg extract invoice.pdf --ocr --format markdown
# Extract with rich metadata
uvx kreuzberg extract report.pdf --show-metadata --format json
```
### Python Usage
**Async (recommended for web apps):**
```python
from kreuzberg import extract_file
# In your async function
result = await extract_file("presentation.pptx")
print(result.content)
# Rich metadata extraction
print(f"Title: {result.metadata.title}")
print(f"Author: {result.metadata.author}")
print(f"Page count: {result.metadata.page_count}")
print(f"Created: {result.metadata.created_at}")
```
**Sync (for scripts and CLI tools):**
```python
from kreuzberg import extract_file_sync
result = extract_file_sync("report.docx")
print(result.content)
# Access rich metadata
print(f"Language: {result.metadata.language}")
print(f"Word count: {result.metadata.word_count}")
print(f"Keywords: {result.metadata.keywords}")
```
### Docker
```bash
# Run the REST API
docker run -p 8000:8000 goldziher/kreuzberg
# Extract via API
curl -X POST -F "file=@document.pdf" http://localhost:8000/extract
```
📖 **[Installation Guide](https://kreuzberg.dev/getting-started/installation/)** • **[CLI Documentation](https://kreuzberg.dev/cli/)** • **[API Reference](https://kreuzberg.dev/api-reference/)**
## Deployment Options
### 🤖 MCP Server (AI Integration)
**Add to Claude Desktop with one command:**
```bash
claude mcp add kreuzberg uvx -- --from "kreuzberg[all]" kreuzberg-mcp
```
**Or configure manually in `claude_desktop_config.json`:**
```json
{
"mcpServers": {
"kreuzberg": {
"command": "uvx",
"args": ["--from", "kreuzberg[all]", "kreuzberg-mcp"]
}
}
}
```
**MCP capabilities:**
- Extract text from PDFs, images, Office docs, and more
- Full OCR support with multiple engines
- Table extraction and metadata parsing
📖 **[MCP Documentation](https://kreuzberg.dev/user-guide/mcp-server/)**
## Supported Formats
| Category | Formats |
| ----------------- | ------------------------------ |
| **Documents** | PDF, DOCX, DOC, RTF, TXT, EPUB |
| **Images** | JPG, PNG, TIFF, BMP, GIF, WEBP |
| **Spreadsheets** | XLSX, XLS, CSV, ODS |
| **Presentations** | PPTX, PPT, ODP |
| **Web** | HTML, XML, MHTML |
| **Archives** | Support via extraction |
## 📊 Performance Characteristics
[View comprehensive benchmarks](https://benchmarks.kreuzberg.dev/) • [Benchmark methodology](https://github.com/Goldziher/python-text-extraction-libs-benchmarks) • [**Detailed Analysis**](https://kreuzberg.dev/performance-analysis/)
### Technical Specifications
| Metric | Kreuzberg Sync | Kreuzberg Async | Benchmarked |
| ---------------------------- | -------------- | --------------- | ------------------ |
| **Throughput (tiny files)** | 31.78 files/s | 23.94 files/s | Highest throughput |
| **Throughput (small files)** | 8.91 files/s | 9.31 files/s | Highest throughput |
| **Memory footprint** | 359.8 MB | 395.2 MB | Lowest usage |
| **Installation size** | 71 MB | 71 MB | Smallest size |
| **Success rate** | 100% | 100% | Perfect |
| **Supported formats** | 18 | 18 | Comprehensive |
### Architecture Advantages
- **Native C extensions**: Built on PDFium and Tesseract for maximum performance
- **Async/await support**: True asynchronous processing with intelligent task scheduling
- **Memory efficiency**: Streaming architecture minimizes memory allocation
- **Process pooling**: Automatic multiprocessing for CPU-intensive operations
- **Optimized data flow**: Efficient data handling with minimal transformations
> **Benchmark details**: Tests include PDFs, Word docs, HTML, images, and spreadsheets in multiple languages (English, Hebrew, German, Chinese, Japanese, Korean) on standardized hardware.
## Documentation
### Quick Links
- [Installation Guide](https://kreuzberg.dev/getting-started/installation/) - Setup and dependencies
- [User Guide](https://kreuzberg.dev/user-guide/) - Comprehensive usage guide
- [Performance Analysis](https://kreuzberg.dev/performance-analysis/) - Detailed benchmark results
- [API Reference](https://kreuzberg.dev/api-reference/) - Complete API documentation
- [Docker Guide](https://kreuzberg.dev/user-guide/docker/) - Container deployment
- [REST API](https://kreuzberg.dev/user-guide/api-server/) - HTTP endpoints
- [CLI Guide](https://kreuzberg.dev/cli/) - Command-line usage
- [OCR Configuration](https://kreuzberg.dev/user-guide/ocr-configuration/) - OCR engine setup
## License
MIT License - see [LICENSE](LICENSE) for details.
Raw data
{
"_id": null,
"home_page": null,
"name": "kreuzberg",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": "async, document-analysis, document-intelligence, document-processing, extensible, information-extraction, mcp, metadata-extraction, model-context-protocol, ocr, pandoc, pdf-extraction, pdfium, plugin-architecture, rag, retrieval-augmented-generation, structured-data, table-extraction, tesseract, text-extraction",
"author": null,
"author_email": "Na'aman Hirschfeld <nhirschfed@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/1a/92/2ebbdc01cb74640a495269f71ae577197aecbedf62d27e8f5ea1f064f098/kreuzberg-3.8.2.tar.gz",
"platform": null,
"description": "# Kreuzberg\n\n[](https://discord.gg/pXxagNK2zN)\n[](https://badge.fury.io/py/kreuzberg)\n[](https://kreuzberg.dev/)\n[](https://benchmarks.kreuzberg.dev/)\n[](https://opensource.org/licenses/MIT)\n[](https://github.com/Goldziher/kreuzberg)\n\n**A document intelligence framework for Python.** Extract text, metadata, and structured information from diverse document formats through a unified, extensible API. Built on established open source foundations including Pandoc, PDFium, and Tesseract.\n\n\ud83d\udcd6 **[Complete Documentation](https://kreuzberg.dev/)**\n\n## Framework Overview\n\n### Document Intelligence Capabilities\n\n- **Text Extraction**: High-fidelity text extraction preserving document structure and formatting\n- **Metadata Extraction**: Comprehensive metadata including author, creation date, language, and document properties\n- **Format Support**: 18 document types including PDF, Microsoft Office, images, HTML, and structured data formats\n- **OCR Integration**: Multiple OCR engines (Tesseract, EasyOCR, PaddleOCR) with automatic fallback\n- **Table Detection**: Structured table extraction with cell-level precision via GMFT integration\n\n### Technical Architecture\n\n- **Performance**: Highest throughput among Python document processing frameworks (30+ docs/second)\n- **Resource Efficiency**: 71MB installation, ~360MB runtime memory footprint\n- **Extensibility**: Plugin architecture for custom extractors via the Extractor base class\n- **API Design**: Synchronous and asynchronous APIs with consistent interfaces\n- **Type Safety**: Complete type annotations throughout the codebase\n\n### Open Source Foundation\n\nKreuzberg leverages established open source technologies:\n\n- **Pandoc**: Universal document converter for robust format support\n- **PDFium**: Google's PDF rendering engine for accurate PDF processing\n- **Tesseract**: Google's OCR engine for text recognition\n- **Python-docx/pptx**: Native Microsoft Office format support\n\n## Quick Start\n\n### Extract Text with CLI\n\n```bash\n# Extract text from any file to markdown\nuvx kreuzberg extract document.pdf > output.md\n\n# With all features (OCR, table extraction, etc.)\nuvx --from \"kreuzberg[all]\" kreuzberg extract invoice.pdf --ocr --format markdown\n\n# Extract with rich metadata\nuvx kreuzberg extract report.pdf --show-metadata --format json\n```\n\n### Python Usage\n\n**Async (recommended for web apps):**\n\n```python\nfrom kreuzberg import extract_file\n\n# In your async function\nresult = await extract_file(\"presentation.pptx\")\nprint(result.content)\n\n# Rich metadata extraction\nprint(f\"Title: {result.metadata.title}\")\nprint(f\"Author: {result.metadata.author}\")\nprint(f\"Page count: {result.metadata.page_count}\")\nprint(f\"Created: {result.metadata.created_at}\")\n```\n\n**Sync (for scripts and CLI tools):**\n\n```python\nfrom kreuzberg import extract_file_sync\n\nresult = extract_file_sync(\"report.docx\")\nprint(result.content)\n\n# Access rich metadata\nprint(f\"Language: {result.metadata.language}\")\nprint(f\"Word count: {result.metadata.word_count}\")\nprint(f\"Keywords: {result.metadata.keywords}\")\n```\n\n### Docker\n\n```bash\n# Run the REST API\ndocker run -p 8000:8000 goldziher/kreuzberg\n\n# Extract via API\ncurl -X POST -F \"file=@document.pdf\" http://localhost:8000/extract\n```\n\n\ud83d\udcd6 **[Installation Guide](https://kreuzberg.dev/getting-started/installation/)** \u2022 **[CLI Documentation](https://kreuzberg.dev/cli/)** \u2022 **[API Reference](https://kreuzberg.dev/api-reference/)**\n\n## Deployment Options\n\n### \ud83e\udd16 MCP Server (AI Integration)\n\n**Add to Claude Desktop with one command:**\n\n```bash\nclaude mcp add kreuzberg uvx -- --from \"kreuzberg[all]\" kreuzberg-mcp\n```\n\n**Or configure manually in `claude_desktop_config.json`:**\n\n```json\n{\n \"mcpServers\": {\n \"kreuzberg\": {\n \"command\": \"uvx\",\n \"args\": [\"--from\", \"kreuzberg[all]\", \"kreuzberg-mcp\"]\n }\n }\n}\n```\n\n**MCP capabilities:**\n\n- Extract text from PDFs, images, Office docs, and more\n- Full OCR support with multiple engines\n- Table extraction and metadata parsing\n\n\ud83d\udcd6 **[MCP Documentation](https://kreuzberg.dev/user-guide/mcp-server/)**\n\n## Supported Formats\n\n| Category | Formats |\n| ----------------- | ------------------------------ |\n| **Documents** | PDF, DOCX, DOC, RTF, TXT, EPUB |\n| **Images** | JPG, PNG, TIFF, BMP, GIF, WEBP |\n| **Spreadsheets** | XLSX, XLS, CSV, ODS |\n| **Presentations** | PPTX, PPT, ODP |\n| **Web** | HTML, XML, MHTML |\n| **Archives** | Support via extraction |\n\n## \ud83d\udcca Performance Characteristics\n\n[View comprehensive benchmarks](https://benchmarks.kreuzberg.dev/) \u2022 [Benchmark methodology](https://github.com/Goldziher/python-text-extraction-libs-benchmarks) \u2022 [**Detailed Analysis**](https://kreuzberg.dev/performance-analysis/)\n\n### Technical Specifications\n\n| Metric | Kreuzberg Sync | Kreuzberg Async | Benchmarked |\n| ---------------------------- | -------------- | --------------- | ------------------ |\n| **Throughput (tiny files)** | 31.78 files/s | 23.94 files/s | Highest throughput |\n| **Throughput (small files)** | 8.91 files/s | 9.31 files/s | Highest throughput |\n| **Memory footprint** | 359.8 MB | 395.2 MB | Lowest usage |\n| **Installation size** | 71 MB | 71 MB | Smallest size |\n| **Success rate** | 100% | 100% | Perfect |\n| **Supported formats** | 18 | 18 | Comprehensive |\n\n### Architecture Advantages\n\n- **Native C extensions**: Built on PDFium and Tesseract for maximum performance\n- **Async/await support**: True asynchronous processing with intelligent task scheduling\n- **Memory efficiency**: Streaming architecture minimizes memory allocation\n- **Process pooling**: Automatic multiprocessing for CPU-intensive operations\n- **Optimized data flow**: Efficient data handling with minimal transformations\n\n> **Benchmark details**: Tests include PDFs, Word docs, HTML, images, and spreadsheets in multiple languages (English, Hebrew, German, Chinese, Japanese, Korean) on standardized hardware.\n\n## Documentation\n\n### Quick Links\n\n- [Installation Guide](https://kreuzberg.dev/getting-started/installation/) - Setup and dependencies\n- [User Guide](https://kreuzberg.dev/user-guide/) - Comprehensive usage guide\n- [Performance Analysis](https://kreuzberg.dev/performance-analysis/) - Detailed benchmark results\n- [API Reference](https://kreuzberg.dev/api-reference/) - Complete API documentation\n- [Docker Guide](https://kreuzberg.dev/user-guide/docker/) - Container deployment\n- [REST API](https://kreuzberg.dev/user-guide/api-server/) - HTTP endpoints\n- [CLI Guide](https://kreuzberg.dev/cli/) - Command-line usage\n- [OCR Configuration](https://kreuzberg.dev/user-guide/ocr-configuration/) - OCR engine setup\n\n## License\n\nMIT License - see [LICENSE](LICENSE) for details.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Document intelligence framework for Python - Extract text, metadata, and structured data from diverse file formats",
"version": "3.8.2",
"project_urls": {
"documentation": "https://kreuzberg.dev",
"homepage": "https://github.com/Goldziher/kreuzberg"
},
"split_keywords": [
"async",
" document-analysis",
" document-intelligence",
" document-processing",
" extensible",
" information-extraction",
" mcp",
" metadata-extraction",
" model-context-protocol",
" ocr",
" pandoc",
" pdf-extraction",
" pdfium",
" plugin-architecture",
" rag",
" retrieval-augmented-generation",
" structured-data",
" table-extraction",
" tesseract",
" text-extraction"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "a2fab21c5d9b78a0f2f9bdba6ae966904a408e34fc6c56d684c8d93e560e9870",
"md5": "0d339d9b7c2449e85c21e0f4f1e53379",
"sha256": "370b436713c3c03b37da2a8451bd5671ee2aa0790012abe2e204eeaf14bd8492"
},
"downloads": -1,
"filename": "kreuzberg-3.8.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "0d339d9b7c2449e85c21e0f4f1e53379",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 107565,
"upload_time": "2025-07-13T23:04:30",
"upload_time_iso_8601": "2025-07-13T23:04:30.278660Z",
"url": "https://files.pythonhosted.org/packages/a2/fa/b21c5d9b78a0f2f9bdba6ae966904a408e34fc6c56d684c8d93e560e9870/kreuzberg-3.8.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "1a922ebbdc01cb74640a495269f71ae577197aecbedf62d27e8f5ea1f064f098",
"md5": "cd43040eb5aec653933bfe72c7447f43",
"sha256": "0b925074b0d7b012d833ab9f53e874895b8c184e9a2c9b816f146d1d4b3b8dbd"
},
"downloads": -1,
"filename": "kreuzberg-3.8.2.tar.gz",
"has_sig": false,
"md5_digest": "cd43040eb5aec653933bfe72c7447f43",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 9828844,
"upload_time": "2025-07-13T23:04:31",
"upload_time_iso_8601": "2025-07-13T23:04:31.843440Z",
"url": "https://files.pythonhosted.org/packages/1a/92/2ebbdc01cb74640a495269f71ae577197aecbedf62d27e8f5ea1f064f098/kreuzberg-3.8.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-13 23:04:31",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Goldziher",
"github_project": "kreuzberg",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "kreuzberg"
}