kreuzberg


Namekreuzberg JSON
Version 3.8.2 PyPI version JSON
download
home_pageNone
SummaryDocument intelligence framework for Python - Extract text, metadata, and structured data from diverse file formats
upload_time2025-07-13 23:04:31
maintainerNone
docs_urlNone
authorNone
requires_python>=3.10
licenseMIT
keywords async document-analysis document-intelligence document-processing extensible information-extraction mcp metadata-extraction model-context-protocol ocr pandoc pdf-extraction pdfium plugin-architecture rag retrieval-augmented-generation structured-data table-extraction tesseract text-extraction
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Kreuzberg

[![Discord](https://img.shields.io/badge/Discord-Join%20our%20community-7289da)](https://discord.gg/pXxagNK2zN)
[![PyPI version](https://badge.fury.io/py/kreuzberg.svg)](https://badge.fury.io/py/kreuzberg)
[![Documentation](https://img.shields.io/badge/docs-kreuzberg.dev-blue)](https://kreuzberg.dev/)
[![Benchmarks](https://img.shields.io/badge/benchmarks-fastest%20CPU-orange)](https://benchmarks.kreuzberg.dev/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Test Coverage](https://img.shields.io/badge/coverage-95%25-green)](https://github.com/Goldziher/kreuzberg)

**A document intelligence framework for Python.** Extract text, metadata, and structured information from diverse document formats through a unified, extensible API. Built on established open source foundations including Pandoc, PDFium, and Tesseract.

📖 **[Complete Documentation](https://kreuzberg.dev/)**

## Framework Overview

### Document Intelligence Capabilities

- **Text Extraction**: High-fidelity text extraction preserving document structure and formatting
- **Metadata Extraction**: Comprehensive metadata including author, creation date, language, and document properties
- **Format Support**: 18 document types including PDF, Microsoft Office, images, HTML, and structured data formats
- **OCR Integration**: Multiple OCR engines (Tesseract, EasyOCR, PaddleOCR) with automatic fallback
- **Table Detection**: Structured table extraction with cell-level precision via GMFT integration

### Technical Architecture

- **Performance**: Highest throughput among Python document processing frameworks (30+ docs/second)
- **Resource Efficiency**: 71MB installation, ~360MB runtime memory footprint
- **Extensibility**: Plugin architecture for custom extractors via the Extractor base class
- **API Design**: Synchronous and asynchronous APIs with consistent interfaces
- **Type Safety**: Complete type annotations throughout the codebase

### Open Source Foundation

Kreuzberg leverages established open source technologies:

- **Pandoc**: Universal document converter for robust format support
- **PDFium**: Google's PDF rendering engine for accurate PDF processing
- **Tesseract**: Google's OCR engine for text recognition
- **Python-docx/pptx**: Native Microsoft Office format support

## Quick Start

### Extract Text with CLI

```bash
# Extract text from any file to markdown
uvx kreuzberg extract document.pdf > output.md

# With all features (OCR, table extraction, etc.)
uvx --from "kreuzberg[all]" kreuzberg extract invoice.pdf --ocr --format markdown

# Extract with rich metadata
uvx kreuzberg extract report.pdf --show-metadata --format json
```

### Python Usage

**Async (recommended for web apps):**

```python
from kreuzberg import extract_file

# In your async function
result = await extract_file("presentation.pptx")
print(result.content)

# Rich metadata extraction
print(f"Title: {result.metadata.title}")
print(f"Author: {result.metadata.author}")
print(f"Page count: {result.metadata.page_count}")
print(f"Created: {result.metadata.created_at}")
```

**Sync (for scripts and CLI tools):**

```python
from kreuzberg import extract_file_sync

result = extract_file_sync("report.docx")
print(result.content)

# Access rich metadata
print(f"Language: {result.metadata.language}")
print(f"Word count: {result.metadata.word_count}")
print(f"Keywords: {result.metadata.keywords}")
```

### Docker

```bash
# Run the REST API
docker run -p 8000:8000 goldziher/kreuzberg

# Extract via API
curl -X POST -F "file=@document.pdf" http://localhost:8000/extract
```

📖 **[Installation Guide](https://kreuzberg.dev/getting-started/installation/)** • **[CLI Documentation](https://kreuzberg.dev/cli/)** • **[API Reference](https://kreuzberg.dev/api-reference/)**

## Deployment Options

### 🤖 MCP Server (AI Integration)

**Add to Claude Desktop with one command:**

```bash
claude mcp add kreuzberg uvx -- --from "kreuzberg[all]" kreuzberg-mcp
```

**Or configure manually in `claude_desktop_config.json`:**

```json
{
  "mcpServers": {
    "kreuzberg": {
      "command": "uvx",
      "args": ["--from", "kreuzberg[all]", "kreuzberg-mcp"]
    }
  }
}
```

**MCP capabilities:**

- Extract text from PDFs, images, Office docs, and more
- Full OCR support with multiple engines
- Table extraction and metadata parsing

📖 **[MCP Documentation](https://kreuzberg.dev/user-guide/mcp-server/)**

## Supported Formats

| Category          | Formats                        |
| ----------------- | ------------------------------ |
| **Documents**     | PDF, DOCX, DOC, RTF, TXT, EPUB |
| **Images**        | JPG, PNG, TIFF, BMP, GIF, WEBP |
| **Spreadsheets**  | XLSX, XLS, CSV, ODS            |
| **Presentations** | PPTX, PPT, ODP                 |
| **Web**           | HTML, XML, MHTML               |
| **Archives**      | Support via extraction         |

## 📊 Performance Characteristics

[View comprehensive benchmarks](https://benchmarks.kreuzberg.dev/) • [Benchmark methodology](https://github.com/Goldziher/python-text-extraction-libs-benchmarks) • [**Detailed Analysis**](https://kreuzberg.dev/performance-analysis/)

### Technical Specifications

| Metric                       | Kreuzberg Sync | Kreuzberg Async | Benchmarked        |
| ---------------------------- | -------------- | --------------- | ------------------ |
| **Throughput (tiny files)**  | 31.78 files/s  | 23.94 files/s   | Highest throughput |
| **Throughput (small files)** | 8.91 files/s   | 9.31 files/s    | Highest throughput |
| **Memory footprint**         | 359.8 MB       | 395.2 MB        | Lowest usage       |
| **Installation size**        | 71 MB          | 71 MB           | Smallest size      |
| **Success rate**             | 100%           | 100%            | Perfect            |
| **Supported formats**        | 18             | 18              | Comprehensive      |

### Architecture Advantages

- **Native C extensions**: Built on PDFium and Tesseract for maximum performance
- **Async/await support**: True asynchronous processing with intelligent task scheduling
- **Memory efficiency**: Streaming architecture minimizes memory allocation
- **Process pooling**: Automatic multiprocessing for CPU-intensive operations
- **Optimized data flow**: Efficient data handling with minimal transformations

> **Benchmark details**: Tests include PDFs, Word docs, HTML, images, and spreadsheets in multiple languages (English, Hebrew, German, Chinese, Japanese, Korean) on standardized hardware.

## Documentation

### Quick Links

- [Installation Guide](https://kreuzberg.dev/getting-started/installation/) - Setup and dependencies
- [User Guide](https://kreuzberg.dev/user-guide/) - Comprehensive usage guide
- [Performance Analysis](https://kreuzberg.dev/performance-analysis/) - Detailed benchmark results
- [API Reference](https://kreuzberg.dev/api-reference/) - Complete API documentation
- [Docker Guide](https://kreuzberg.dev/user-guide/docker/) - Container deployment
- [REST API](https://kreuzberg.dev/user-guide/api-server/) - HTTP endpoints
- [CLI Guide](https://kreuzberg.dev/cli/) - Command-line usage
- [OCR Configuration](https://kreuzberg.dev/user-guide/ocr-configuration/) - OCR engine setup

## License

MIT License - see [LICENSE](LICENSE) for details.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "kreuzberg",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "async, document-analysis, document-intelligence, document-processing, extensible, information-extraction, mcp, metadata-extraction, model-context-protocol, ocr, pandoc, pdf-extraction, pdfium, plugin-architecture, rag, retrieval-augmented-generation, structured-data, table-extraction, tesseract, text-extraction",
    "author": null,
    "author_email": "Na'aman Hirschfeld <nhirschfed@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/1a/92/2ebbdc01cb74640a495269f71ae577197aecbedf62d27e8f5ea1f064f098/kreuzberg-3.8.2.tar.gz",
    "platform": null,
    "description": "# Kreuzberg\n\n[![Discord](https://img.shields.io/badge/Discord-Join%20our%20community-7289da)](https://discord.gg/pXxagNK2zN)\n[![PyPI version](https://badge.fury.io/py/kreuzberg.svg)](https://badge.fury.io/py/kreuzberg)\n[![Documentation](https://img.shields.io/badge/docs-kreuzberg.dev-blue)](https://kreuzberg.dev/)\n[![Benchmarks](https://img.shields.io/badge/benchmarks-fastest%20CPU-orange)](https://benchmarks.kreuzberg.dev/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![Test Coverage](https://img.shields.io/badge/coverage-95%25-green)](https://github.com/Goldziher/kreuzberg)\n\n**A document intelligence framework for Python.** Extract text, metadata, and structured information from diverse document formats through a unified, extensible API. Built on established open source foundations including Pandoc, PDFium, and Tesseract.\n\n\ud83d\udcd6 **[Complete Documentation](https://kreuzberg.dev/)**\n\n## Framework Overview\n\n### Document Intelligence Capabilities\n\n- **Text Extraction**: High-fidelity text extraction preserving document structure and formatting\n- **Metadata Extraction**: Comprehensive metadata including author, creation date, language, and document properties\n- **Format Support**: 18 document types including PDF, Microsoft Office, images, HTML, and structured data formats\n- **OCR Integration**: Multiple OCR engines (Tesseract, EasyOCR, PaddleOCR) with automatic fallback\n- **Table Detection**: Structured table extraction with cell-level precision via GMFT integration\n\n### Technical Architecture\n\n- **Performance**: Highest throughput among Python document processing frameworks (30+ docs/second)\n- **Resource Efficiency**: 71MB installation, ~360MB runtime memory footprint\n- **Extensibility**: Plugin architecture for custom extractors via the Extractor base class\n- **API Design**: Synchronous and asynchronous APIs with consistent interfaces\n- **Type Safety**: Complete type annotations throughout the codebase\n\n### Open Source Foundation\n\nKreuzberg leverages established open source technologies:\n\n- **Pandoc**: Universal document converter for robust format support\n- **PDFium**: Google's PDF rendering engine for accurate PDF processing\n- **Tesseract**: Google's OCR engine for text recognition\n- **Python-docx/pptx**: Native Microsoft Office format support\n\n## Quick Start\n\n### Extract Text with CLI\n\n```bash\n# Extract text from any file to markdown\nuvx kreuzberg extract document.pdf > output.md\n\n# With all features (OCR, table extraction, etc.)\nuvx --from \"kreuzberg[all]\" kreuzberg extract invoice.pdf --ocr --format markdown\n\n# Extract with rich metadata\nuvx kreuzberg extract report.pdf --show-metadata --format json\n```\n\n### Python Usage\n\n**Async (recommended for web apps):**\n\n```python\nfrom kreuzberg import extract_file\n\n# In your async function\nresult = await extract_file(\"presentation.pptx\")\nprint(result.content)\n\n# Rich metadata extraction\nprint(f\"Title: {result.metadata.title}\")\nprint(f\"Author: {result.metadata.author}\")\nprint(f\"Page count: {result.metadata.page_count}\")\nprint(f\"Created: {result.metadata.created_at}\")\n```\n\n**Sync (for scripts and CLI tools):**\n\n```python\nfrom kreuzberg import extract_file_sync\n\nresult = extract_file_sync(\"report.docx\")\nprint(result.content)\n\n# Access rich metadata\nprint(f\"Language: {result.metadata.language}\")\nprint(f\"Word count: {result.metadata.word_count}\")\nprint(f\"Keywords: {result.metadata.keywords}\")\n```\n\n### Docker\n\n```bash\n# Run the REST API\ndocker run -p 8000:8000 goldziher/kreuzberg\n\n# Extract via API\ncurl -X POST -F \"file=@document.pdf\" http://localhost:8000/extract\n```\n\n\ud83d\udcd6 **[Installation Guide](https://kreuzberg.dev/getting-started/installation/)** \u2022 **[CLI Documentation](https://kreuzberg.dev/cli/)** \u2022 **[API Reference](https://kreuzberg.dev/api-reference/)**\n\n## Deployment Options\n\n### \ud83e\udd16 MCP Server (AI Integration)\n\n**Add to Claude Desktop with one command:**\n\n```bash\nclaude mcp add kreuzberg uvx -- --from \"kreuzberg[all]\" kreuzberg-mcp\n```\n\n**Or configure manually in `claude_desktop_config.json`:**\n\n```json\n{\n  \"mcpServers\": {\n    \"kreuzberg\": {\n      \"command\": \"uvx\",\n      \"args\": [\"--from\", \"kreuzberg[all]\", \"kreuzberg-mcp\"]\n    }\n  }\n}\n```\n\n**MCP capabilities:**\n\n- Extract text from PDFs, images, Office docs, and more\n- Full OCR support with multiple engines\n- Table extraction and metadata parsing\n\n\ud83d\udcd6 **[MCP Documentation](https://kreuzberg.dev/user-guide/mcp-server/)**\n\n## Supported Formats\n\n| Category          | Formats                        |\n| ----------------- | ------------------------------ |\n| **Documents**     | PDF, DOCX, DOC, RTF, TXT, EPUB |\n| **Images**        | JPG, PNG, TIFF, BMP, GIF, WEBP |\n| **Spreadsheets**  | XLSX, XLS, CSV, ODS            |\n| **Presentations** | PPTX, PPT, ODP                 |\n| **Web**           | HTML, XML, MHTML               |\n| **Archives**      | Support via extraction         |\n\n## \ud83d\udcca Performance Characteristics\n\n[View comprehensive benchmarks](https://benchmarks.kreuzberg.dev/) \u2022 [Benchmark methodology](https://github.com/Goldziher/python-text-extraction-libs-benchmarks) \u2022 [**Detailed Analysis**](https://kreuzberg.dev/performance-analysis/)\n\n### Technical Specifications\n\n| Metric                       | Kreuzberg Sync | Kreuzberg Async | Benchmarked        |\n| ---------------------------- | -------------- | --------------- | ------------------ |\n| **Throughput (tiny files)**  | 31.78 files/s  | 23.94 files/s   | Highest throughput |\n| **Throughput (small files)** | 8.91 files/s   | 9.31 files/s    | Highest throughput |\n| **Memory footprint**         | 359.8 MB       | 395.2 MB        | Lowest usage       |\n| **Installation size**        | 71 MB          | 71 MB           | Smallest size      |\n| **Success rate**             | 100%           | 100%            | Perfect            |\n| **Supported formats**        | 18             | 18              | Comprehensive      |\n\n### Architecture Advantages\n\n- **Native C extensions**: Built on PDFium and Tesseract for maximum performance\n- **Async/await support**: True asynchronous processing with intelligent task scheduling\n- **Memory efficiency**: Streaming architecture minimizes memory allocation\n- **Process pooling**: Automatic multiprocessing for CPU-intensive operations\n- **Optimized data flow**: Efficient data handling with minimal transformations\n\n> **Benchmark details**: Tests include PDFs, Word docs, HTML, images, and spreadsheets in multiple languages (English, Hebrew, German, Chinese, Japanese, Korean) on standardized hardware.\n\n## Documentation\n\n### Quick Links\n\n- [Installation Guide](https://kreuzberg.dev/getting-started/installation/) - Setup and dependencies\n- [User Guide](https://kreuzberg.dev/user-guide/) - Comprehensive usage guide\n- [Performance Analysis](https://kreuzberg.dev/performance-analysis/) - Detailed benchmark results\n- [API Reference](https://kreuzberg.dev/api-reference/) - Complete API documentation\n- [Docker Guide](https://kreuzberg.dev/user-guide/docker/) - Container deployment\n- [REST API](https://kreuzberg.dev/user-guide/api-server/) - HTTP endpoints\n- [CLI Guide](https://kreuzberg.dev/cli/) - Command-line usage\n- [OCR Configuration](https://kreuzberg.dev/user-guide/ocr-configuration/) - OCR engine setup\n\n## License\n\nMIT License - see [LICENSE](LICENSE) for details.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Document intelligence framework for Python - Extract text, metadata, and structured data from diverse file formats",
    "version": "3.8.2",
    "project_urls": {
        "documentation": "https://kreuzberg.dev",
        "homepage": "https://github.com/Goldziher/kreuzberg"
    },
    "split_keywords": [
        "async",
        " document-analysis",
        " document-intelligence",
        " document-processing",
        " extensible",
        " information-extraction",
        " mcp",
        " metadata-extraction",
        " model-context-protocol",
        " ocr",
        " pandoc",
        " pdf-extraction",
        " pdfium",
        " plugin-architecture",
        " rag",
        " retrieval-augmented-generation",
        " structured-data",
        " table-extraction",
        " tesseract",
        " text-extraction"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "a2fab21c5d9b78a0f2f9bdba6ae966904a408e34fc6c56d684c8d93e560e9870",
                "md5": "0d339d9b7c2449e85c21e0f4f1e53379",
                "sha256": "370b436713c3c03b37da2a8451bd5671ee2aa0790012abe2e204eeaf14bd8492"
            },
            "downloads": -1,
            "filename": "kreuzberg-3.8.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "0d339d9b7c2449e85c21e0f4f1e53379",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 107565,
            "upload_time": "2025-07-13T23:04:30",
            "upload_time_iso_8601": "2025-07-13T23:04:30.278660Z",
            "url": "https://files.pythonhosted.org/packages/a2/fa/b21c5d9b78a0f2f9bdba6ae966904a408e34fc6c56d684c8d93e560e9870/kreuzberg-3.8.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "1a922ebbdc01cb74640a495269f71ae577197aecbedf62d27e8f5ea1f064f098",
                "md5": "cd43040eb5aec653933bfe72c7447f43",
                "sha256": "0b925074b0d7b012d833ab9f53e874895b8c184e9a2c9b816f146d1d4b3b8dbd"
            },
            "downloads": -1,
            "filename": "kreuzberg-3.8.2.tar.gz",
            "has_sig": false,
            "md5_digest": "cd43040eb5aec653933bfe72c7447f43",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 9828844,
            "upload_time": "2025-07-13T23:04:31",
            "upload_time_iso_8601": "2025-07-13T23:04:31.843440Z",
            "url": "https://files.pythonhosted.org/packages/1a/92/2ebbdc01cb74640a495269f71ae577197aecbedf62d27e8f5ea1f064f098/kreuzberg-3.8.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-13 23:04:31",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Goldziher",
    "github_project": "kreuzberg",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "kreuzberg"
}
        
Elapsed time: 0.86532s