kreuzberg


Namekreuzberg JSON
Version 3.6.2 PyPI version JSON
download
home_pageNone
SummaryA text extraction library supporting PDFs, images, office documents and more
upload_time2025-07-11 07:21:57
maintainerNone
docs_urlNone
authorNone
requires_python>=3.10
licenseMIT
keywords document-processing entity-extraction image-to-text keyword-extraction named-entity-recognition ner ocr pandoc pdf-extraction rag spacy table-extraction tesseract text-extraction text-processing
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Kreuzberg

[![Discord](https://img.shields.io/badge/Discord-Join%20our%20community-7289da)](https://discord.gg/pXxagNK2zN)
[![PyPI version](https://badge.fury.io/py/kreuzberg.svg)](https://badge.fury.io/py/kreuzberg)
[![Documentation](https://img.shields.io/badge/docs-GitHub_Pages-blue)](https://goldziher.github.io/kreuzberg/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

**High-performance Python library for text extraction from documents.** Extract text from PDFs, images, office documents, and more with both async and sync APIs.

πŸ“– **[Complete Documentation](https://goldziher.github.io/kreuzberg/)**

## Why Kreuzberg?

- **πŸš€ Fastest Performance**: [35+ files/second](https://goldziher.github.io/python-text-extraction-libs-benchmarks/) - the fastest text extraction library
- **πŸ’Ύ Memory Efficient**: 14x smaller than alternatives (71MB vs 1GB+) with lowest memory usage (~530MB)
- **⚑ Dual APIs**: Only library with both sync and async support
- **πŸ”§ Zero Configuration**: Works out of the box with sane defaults
- **🏠 Local Processing**: No cloud dependencies or external API calls
- **πŸ“¦ Rich Format Support**: PDFs, images, Office docs, HTML, and more
- **πŸ” Multiple OCR Engines**: Tesseract, EasyOCR, and PaddleOCR support
- **🐳 Production Ready**: CLI, REST API, and Docker images included

## Quick Start

### Installation

```bash
# Basic installation
pip install kreuzberg

# With optional features
pip install "kreuzberg[cli,api]"        # CLI + REST API
pip install "kreuzberg[easyocr,gmft]"   # EasyOCR + table extraction
pip install "kreuzberg[all]"            # Everything
```

### System Dependencies

```bash
# Ubuntu/Debian
sudo apt-get install tesseract-ocr pandoc

# macOS
brew install tesseract pandoc

# Windows
choco install tesseract pandoc
```

### Basic Usage

```python
import asyncio
from kreuzberg import extract_file

async def main():
    # Extract from any document type
    result = await extract_file("document.pdf")
    print(result.content)
    print(result.metadata)

asyncio.run(main())
```

## Deployment Options

### 🐳 Docker (Recommended)

```bash
# Run API server
docker run -p 8000:8000 goldziher/kreuzberg:latest

# Extract files
curl -X POST http://localhost:8000/extract -F "data=@document.pdf"
```

Available variants: `latest`, `3.6.1`, `3.6.1-easyocr`, `3.6.1-paddle`, `3.6.1-gmft`, `3.6.1-all`

### 🌐 REST API

```bash
# Install and run
pip install "kreuzberg[api]"
litestar --app kreuzberg._api.main:app run

# Health check
curl http://localhost:8000/health

# Extract files
curl -X POST http://localhost:8000/extract -F "data=@file.pdf"
```

### πŸ’» Command Line

```bash
# Install CLI
pip install "kreuzberg[cli]"

# Extract to stdout
kreuzberg extract document.pdf

# JSON output with metadata
kreuzberg extract document.pdf --output-format json --show-metadata

# Batch processing
kreuzberg extract *.pdf --output-dir ./extracted/
```

## Supported Formats

| Category          | Formats                        |
| ----------------- | ------------------------------ |
| **Documents**     | PDF, DOCX, DOC, RTF, TXT, EPUB |
| **Images**        | JPG, PNG, TIFF, BMP, GIF, WEBP |
| **Spreadsheets**  | XLSX, XLS, CSV, ODS            |
| **Presentations** | PPTX, PPT, ODP                 |
| **Web**           | HTML, XML, MHTML               |
| **Archives**      | Support via extraction         |

## Performance

**[Comprehensive benchmarks](https://goldziher.github.io/python-text-extraction-libs-benchmarks/)** across 94 real-world documents (~210MB) β€’ [View source](https://github.com/Goldziher/python-text-extraction-libs-benchmarks):

| Library       | Speed           | Memory    | Install Size | Dependencies | Success Rate |
| ------------- | --------------- | --------- | ------------ | ------------ | ------------ |
| **Kreuzberg** | **35+ files/s** | **530MB** | **71MB**     | **20**       | High\*       |
| Unstructured  | Moderate        | ~1GB      | 146MB        | 54           | 88%+         |
| MarkItDown    | Good†           | ~1.5GB    | 251MB        | 25           | 80%†         |
| Docling       | 60+ min/file‑   | ~5GB      | 1,032MB      | 88           | Low‑         |

\*_Can achieve 75% reliability with 15% performance trade-off when configured_
†_Good on simple documents, struggles with large/complex files (>10MB)_
‑_Frequently fails/times out on medium files (>1MB)_

> **Benchmark details**: Tested across PDFs, Word docs, HTML, images, spreadsheets in 6 languages (English, Hebrew, German, Chinese, Japanese, Korean)
> **Rule of thumb**: Use async API for complex documents and batch processing (up to 4.5x faster)

## Documentation

### Quick Links

- [Installation Guide](https://goldziher.github.io/kreuzberg/getting-started/installation/) - Setup and dependencies
- [User Guide](https://goldziher.github.io/kreuzberg/user-guide/) - Comprehensive usage guide
- [API Reference](https://goldziher.github.io/kreuzberg/api-reference/) - Complete API documentation
- [Docker Guide](https://goldziher.github.io/kreuzberg/user-guide/docker/) - Container deployment
- [REST API](https://goldziher.github.io/kreuzberg/user-guide/api-server/) - HTTP endpoints
- [CLI Guide](https://goldziher.github.io/kreuzberg/cli/) - Command-line usage
- [OCR Configuration](https://goldziher.github.io/kreuzberg/user-guide/ocr-configuration/) - OCR engine setup

## Advanced Features

- **πŸ“Š Table Extraction**: Extract tables from PDFs with GMFT
- **🧩 Content Chunking**: Split documents for RAG applications
- **🎯 Custom Extractors**: Extend with your own document handlers
- **πŸ”§ Configuration**: Flexible TOML-based configuration
- **πŸͺ Hooks**: Pre/post-processing customization
- **🌍 Multi-language OCR**: 100+ languages supported
- **βš™οΈ Metadata Extraction**: Rich document metadata
- **πŸ”„ Batch Processing**: Efficient bulk document processing

## License

MIT License - see [LICENSE](LICENSE) for details.

______________________________________________________________________

<div align="center">

**[Documentation](https://goldziher.github.io/kreuzberg/) β€’ [PyPI](https://pypi.org/project/kreuzberg/) β€’ [Docker Hub](https://hub.docker.com/r/goldziher/kreuzberg) β€’ [Benchmarks](https://github.com/Goldziher/python-text-extraction-libs-benchmarks) β€’ [Discord](https://discord.gg/pXxagNK2zN)**

Made with ❀️ by the [Kreuzberg contributors](https://github.com/Goldziher/kreuzberg/graphs/contributors)

</div>

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "kreuzberg",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "document-processing, entity-extraction, image-to-text, keyword-extraction, named-entity-recognition, ner, ocr, pandoc, pdf-extraction, rag, spacy, table-extraction, tesseract, text-extraction, text-processing",
    "author": null,
    "author_email": "Na'aman Hirschfeld <nhirschfed@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/0c/a2/5b53128503967432369fca403f86fc6687d987f7ee13058ce47f1d60caee/kreuzberg-3.6.2.tar.gz",
    "platform": null,
    "description": "# Kreuzberg\n\n[![Discord](https://img.shields.io/badge/Discord-Join%20our%20community-7289da)](https://discord.gg/pXxagNK2zN)\n[![PyPI version](https://badge.fury.io/py/kreuzberg.svg)](https://badge.fury.io/py/kreuzberg)\n[![Documentation](https://img.shields.io/badge/docs-GitHub_Pages-blue)](https://goldziher.github.io/kreuzberg/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n\n**High-performance Python library for text extraction from documents.** Extract text from PDFs, images, office documents, and more with both async and sync APIs.\n\n\ud83d\udcd6 **[Complete Documentation](https://goldziher.github.io/kreuzberg/)**\n\n## Why Kreuzberg?\n\n- **\ud83d\ude80 Fastest Performance**: [35+ files/second](https://goldziher.github.io/python-text-extraction-libs-benchmarks/) - the fastest text extraction library\n- **\ud83d\udcbe Memory Efficient**: 14x smaller than alternatives (71MB vs 1GB+) with lowest memory usage (~530MB)\n- **\u26a1 Dual APIs**: Only library with both sync and async support\n- **\ud83d\udd27 Zero Configuration**: Works out of the box with sane defaults\n- **\ud83c\udfe0 Local Processing**: No cloud dependencies or external API calls\n- **\ud83d\udce6 Rich Format Support**: PDFs, images, Office docs, HTML, and more\n- **\ud83d\udd0d Multiple OCR Engines**: Tesseract, EasyOCR, and PaddleOCR support\n- **\ud83d\udc33 Production Ready**: CLI, REST API, and Docker images included\n\n## Quick Start\n\n### Installation\n\n```bash\n# Basic installation\npip install kreuzberg\n\n# With optional features\npip install \"kreuzberg[cli,api]\"        # CLI + REST API\npip install \"kreuzberg[easyocr,gmft]\"   # EasyOCR + table extraction\npip install \"kreuzberg[all]\"            # Everything\n```\n\n### System Dependencies\n\n```bash\n# Ubuntu/Debian\nsudo apt-get install tesseract-ocr pandoc\n\n# macOS\nbrew install tesseract pandoc\n\n# Windows\nchoco install tesseract pandoc\n```\n\n### Basic Usage\n\n```python\nimport asyncio\nfrom kreuzberg import extract_file\n\nasync def main():\n    # Extract from any document type\n    result = await extract_file(\"document.pdf\")\n    print(result.content)\n    print(result.metadata)\n\nasyncio.run(main())\n```\n\n## Deployment Options\n\n### \ud83d\udc33 Docker (Recommended)\n\n```bash\n# Run API server\ndocker run -p 8000:8000 goldziher/kreuzberg:latest\n\n# Extract files\ncurl -X POST http://localhost:8000/extract -F \"data=@document.pdf\"\n```\n\nAvailable variants: `latest`, `3.6.1`, `3.6.1-easyocr`, `3.6.1-paddle`, `3.6.1-gmft`, `3.6.1-all`\n\n### \ud83c\udf10 REST API\n\n```bash\n# Install and run\npip install \"kreuzberg[api]\"\nlitestar --app kreuzberg._api.main:app run\n\n# Health check\ncurl http://localhost:8000/health\n\n# Extract files\ncurl -X POST http://localhost:8000/extract -F \"data=@file.pdf\"\n```\n\n### \ud83d\udcbb Command Line\n\n```bash\n# Install CLI\npip install \"kreuzberg[cli]\"\n\n# Extract to stdout\nkreuzberg extract document.pdf\n\n# JSON output with metadata\nkreuzberg extract document.pdf --output-format json --show-metadata\n\n# Batch processing\nkreuzberg extract *.pdf --output-dir ./extracted/\n```\n\n## Supported Formats\n\n| Category          | Formats                        |\n| ----------------- | ------------------------------ |\n| **Documents**     | PDF, DOCX, DOC, RTF, TXT, EPUB |\n| **Images**        | JPG, PNG, TIFF, BMP, GIF, WEBP |\n| **Spreadsheets**  | XLSX, XLS, CSV, ODS            |\n| **Presentations** | PPTX, PPT, ODP                 |\n| **Web**           | HTML, XML, MHTML               |\n| **Archives**      | Support via extraction         |\n\n## Performance\n\n**[Comprehensive benchmarks](https://goldziher.github.io/python-text-extraction-libs-benchmarks/)** across 94 real-world documents (~210MB) \u2022 [View source](https://github.com/Goldziher/python-text-extraction-libs-benchmarks):\n\n| Library       | Speed           | Memory    | Install Size | Dependencies | Success Rate |\n| ------------- | --------------- | --------- | ------------ | ------------ | ------------ |\n| **Kreuzberg** | **35+ files/s** | **530MB** | **71MB**     | **20**       | High\\*       |\n| Unstructured  | Moderate        | ~1GB      | 146MB        | 54           | 88%+         |\n| MarkItDown    | Good\u2020           | ~1.5GB    | 251MB        | 25           | 80%\u2020         |\n| Docling       | 60+ min/file\u2021   | ~5GB      | 1,032MB      | 88           | Low\u2021         |\n\n\\*_Can achieve 75% reliability with 15% performance trade-off when configured_\n\u2020_Good on simple documents, struggles with large/complex files (>10MB)_\n\u2021_Frequently fails/times out on medium files (>1MB)_\n\n> **Benchmark details**: Tested across PDFs, Word docs, HTML, images, spreadsheets in 6 languages (English, Hebrew, German, Chinese, Japanese, Korean)\n> **Rule of thumb**: Use async API for complex documents and batch processing (up to 4.5x faster)\n\n## Documentation\n\n### Quick Links\n\n- [Installation Guide](https://goldziher.github.io/kreuzberg/getting-started/installation/) - Setup and dependencies\n- [User Guide](https://goldziher.github.io/kreuzberg/user-guide/) - Comprehensive usage guide\n- [API Reference](https://goldziher.github.io/kreuzberg/api-reference/) - Complete API documentation\n- [Docker Guide](https://goldziher.github.io/kreuzberg/user-guide/docker/) - Container deployment\n- [REST API](https://goldziher.github.io/kreuzberg/user-guide/api-server/) - HTTP endpoints\n- [CLI Guide](https://goldziher.github.io/kreuzberg/cli/) - Command-line usage\n- [OCR Configuration](https://goldziher.github.io/kreuzberg/user-guide/ocr-configuration/) - OCR engine setup\n\n## Advanced Features\n\n- **\ud83d\udcca Table Extraction**: Extract tables from PDFs with GMFT\n- **\ud83e\udde9 Content Chunking**: Split documents for RAG applications\n- **\ud83c\udfaf Custom Extractors**: Extend with your own document handlers\n- **\ud83d\udd27 Configuration**: Flexible TOML-based configuration\n- **\ud83e\ude9d Hooks**: Pre/post-processing customization\n- **\ud83c\udf0d Multi-language OCR**: 100+ languages supported\n- **\u2699\ufe0f Metadata Extraction**: Rich document metadata\n- **\ud83d\udd04 Batch Processing**: Efficient bulk document processing\n\n## License\n\nMIT License - see [LICENSE](LICENSE) for details.\n\n______________________________________________________________________\n\n<div align=\"center\">\n\n**[Documentation](https://goldziher.github.io/kreuzberg/) \u2022 [PyPI](https://pypi.org/project/kreuzberg/) \u2022 [Docker Hub](https://hub.docker.com/r/goldziher/kreuzberg) \u2022 [Benchmarks](https://github.com/Goldziher/python-text-extraction-libs-benchmarks) \u2022 [Discord](https://discord.gg/pXxagNK2zN)**\n\nMade with \u2764\ufe0f by the [Kreuzberg contributors](https://github.com/Goldziher/kreuzberg/graphs/contributors)\n\n</div>\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A text extraction library supporting PDFs, images, office documents and more",
    "version": "3.6.2",
    "project_urls": {
        "homepage": "https://github.com/Goldziher/kreuzberg"
    },
    "split_keywords": [
        "document-processing",
        " entity-extraction",
        " image-to-text",
        " keyword-extraction",
        " named-entity-recognition",
        " ner",
        " ocr",
        " pandoc",
        " pdf-extraction",
        " rag",
        " spacy",
        " table-extraction",
        " tesseract",
        " text-extraction",
        " text-processing"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "689d80702fabb1fd451e183cb52c48b5535b3eeb19b7a6b4f66e3d8e61e8a227",
                "md5": "02881e050138370139251ea8e01da13b",
                "sha256": "36bba10f46a0347993c50c08c0f6c731bf528cf507e669781a7cb3eecb1daec0"
            },
            "downloads": -1,
            "filename": "kreuzberg-3.6.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "02881e050138370139251ea8e01da13b",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 95190,
            "upload_time": "2025-07-11T07:21:56",
            "upload_time_iso_8601": "2025-07-11T07:21:56.259517Z",
            "url": "https://files.pythonhosted.org/packages/68/9d/80702fabb1fd451e183cb52c48b5535b3eeb19b7a6b4f66e3d8e61e8a227/kreuzberg-3.6.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "0ca25b53128503967432369fca403f86fc6687d987f7ee13058ce47f1d60caee",
                "md5": "0223cb856900b324f298edc67d8241d4",
                "sha256": "4865e4a0be85e72867791dc19bbb8a7c8c4853900b59cb4cbef5bfaecafa1626"
            },
            "downloads": -1,
            "filename": "kreuzberg-3.6.2.tar.gz",
            "has_sig": false,
            "md5_digest": "0223cb856900b324f298edc67d8241d4",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 9389554,
            "upload_time": "2025-07-11T07:21:57",
            "upload_time_iso_8601": "2025-07-11T07:21:57.560903Z",
            "url": "https://files.pythonhosted.org/packages/0c/a2/5b53128503967432369fca403f86fc6687d987f7ee13058ce47f1d60caee/kreuzberg-3.6.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-11 07:21:57",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Goldziher",
    "github_project": "kreuzberg",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "kreuzberg"
}
        
Elapsed time: 0.43046s