kreuzberg


Namekreuzberg JSON
Version 2.1.0 PyPI version JSON
download
home_pageNone
SummaryA text extraction library supporting PDFs, images, office documents and more
upload_time2025-02-20 15:26:43
maintainerNone
docs_urlNone
authorNone
requires_python>=3.9
licenseMIT
keywords document-processing image-to-text ocr pandoc pdf-extraction rag tesseract text-extraction text-processing
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Kreuzberg

Kreuzberg is a Python library for text extraction from documents. It provides a unified async interface for extracting text from PDFs, images, office documents, and more.

## Why Kreuzberg?

- **Simple and Hassle-Free**: Clean API that just works, without complex configuration
- **Local Processing**: No external API calls or cloud dependencies required
- **Resource Efficient**: Lightweight processing without GPU requirements
- **Small Package Size**: Has few curated dependencies and a minimal footprint
- **Format Support**: Comprehensive support for documents, images, and text formats
- **Modern Python**: Built with async/await, type hints, and functional first approach
- **Permissive OSS**: Kreuzberg and its dependencies have a permissive OSS license

Kreuzberg was built for RAG (Retrieval Augmented Generation) applications, focusing on local processing with minimal dependencies. Its designed for modern async applications, serverless functions, and dockerized applications.

## Installation

### 1. Install the Python Package

```shell
pip install kreuzberg
```

### 2. Install System Dependencies

Kreuzberg requires two system level dependencies:

- [Pandoc](https://pandoc.org/installing.html) - For document format conversion. Minimum required version is Pandoc 2.
- [Tesseract OCR](https://tesseract-ocr.github.io/) - For image and PDF OCR. Minimum required version is Tesseract 4.

You can install these with:

#### Linux (Ubuntu)

```shell
sudo apt-get install pandoc tesseract-ocr
```

#### MacOS

```shell
#
brew install tesseract pandoc
```

#### Windows

```shell
choco install -y tesseract pandoc
```

Notes:

- in most distributions the tesseract-ocr package is split into multiple packages, you may need to install any language models you need aside from English separately.
- please consult the official documentation for these libraries for the most up-to-date installation instructions for your platform.

## Architecture

Kreuzberg integrates:

- **PDF Processing**:
  - `pdfium2` for searchable PDFs
  - Tesseract OCR for scanned content
- **Document Conversion**:
  - Pandoc for many document and markup formats
  - `python-pptx` for PowerPoint files
  - `html-to-markdown` for HTML content
  - `calamine` for Excel spreadsheets (with multi-sheet support)
- **Text Processing**:
  - Smart encoding detection
  - Markdown and plain text handling

### Supported Formats

#### Document Formats

- PDF (`.pdf`, both searchable and scanned)
- Microsoft Word (`.docx`)
- PowerPoint presentations (`.pptx`)
- OpenDocument Text (`.odt`)
- Rich Text Format (`.rtf`)
- EPUB (`.epub`)
- DocBook XML (`.dbk`, `.xml`)
- FictionBook (`.fb2`)
- LaTeX (`.tex`, `.latex`)
- Typst (`.typ`)

#### Markup and Text Formats

- HTML (`.html`, `.htm`)
- Plain text (`.txt`) and Markdown (`.md`, `.markdown`)
- reStructuredText (`.rst`)
- Org-mode (`.org`)
- DokuWiki (`.txt`)
- Pod (`.pod`)
- Troff/Man (`.1`, `.2`, etc.)

#### Data and Research Formats

- Spreadsheets (`.xlsx`, `.xls`, `.xlsm`, `.xlsb`, `.xlam`, `.xla`, `.ods`)
- CSV (`.csv`) and TSV (`.tsv`) files
- OPML files (`.opml`)
- Jupyter Notebooks (`.ipynb`)
- BibTeX (`.bib`) and BibLaTeX (`.bib`)
- CSL-JSON (`.json`)
- EndNote and JATS XML (`.xml`)
- RIS (`.ris`)

#### Image Formats

- JPEG (`.jpg`, `.jpeg`, `.pjpeg`)
- PNG (`.png`)
- TIFF (`.tiff`, `.tif`)
- BMP (`.bmp`)
- GIF (`.gif`)
- JPEG 2000 family (`.jp2`, `.jpm`, `.jpx`, `.mj2`)
- WebP (`.webp`)
- Portable anymap formats (`.pbm`, `.pgm`, `.ppm`, `.pnm`)

## Usage

Kreuzberg provides both async and sync APIs for text extraction, including batch processing. The library exports the following main functions:

- Single Item Processing:

  - `extract_file()`: Async function to extract text from a file (accepts string path or `pathlib.Path`)
  - `extract_bytes()`: Async function to extract text from bytes (accepts a byte string)
  - `extract_file_sync()`: Synchronous version of `extract_file()`
  - `extract_bytes_sync()`: Synchronous version of `extract_bytes()`

- Batch Processing:
  - `batch_extract_file()`: Async function to extract text from multiple files concurrently
  - `batch_extract_bytes()`: Async function to extract text from multiple byte contents concurrently
  - `batch_extract_file_sync()`: Synchronous version of `batch_extract_file()`
  - `batch_extract_bytes_sync()`: Synchronous version of `batch_extract_bytes()`

### Configuration Parameters

All extraction functions accept the following optional parameters for configuring OCR and performance:

#### OCR Configuration

- `force_ocr`(default: `False`): Forces OCR processing even for searchable PDFs.
- `language` (default: `eng`): Specifies the language model for Tesseract OCR. This affects text recognition accuracy for documents in different languages. Examples:

  - `eng` for English
  - `deu` for German
  - `eng+deu` for English and German

  Notes: - the order of languages effect processing time, the first language is the primary language and the second language is the secondary language etc.

- `psm` (Page Segmentation Mode, default: `PSM.AUTO`): Controls how Tesseract analyzes page layout. In most cases you do not need to change this to a different value.

Consult the [Tesseract documentation](https://tesseract-ocr.github.io/tessdoc/) for more information on both options.

#### Processing Configuration

- `max_processes` (default: CPU count): Maximum number of concurrent processes for Tesseract.

### Quick Start

```python
from pathlib import Path
from kreuzberg import extract_file
from kreuzberg import ExtractionResult
from kreuzberg import PSMMode


# Basic file extraction
async def extract_document():
    # Extract from a PDF file with default settings
    pdf_result: ExtractionResult = await extract_file("document.pdf")
    print(f"Content: {pdf_result.content}")

    # Extract from an image with German language model
    img_result = await extract_file(
        "scan.png",
        language="deu",  # German language model
        psm=PSMMode.SINGLE_BLOCK,  # Treat as single block of text
        max_processes=4  # Limit concurrent processes
    )
    print(f"Image text: {img_result.content}")

    # Extract from Word document with metadata
    docx_result = await extract_file(Path("document.docx"))
    if docx_result.metadata:
        print(f"Title: {docx_result.metadata.get('title')}")
        print(f"Author: {docx_result.metadata.get('creator')}")
```

### Extracting Bytes

```python
from kreuzberg import extract_bytes
from kreuzberg import ExtractionResult


async def process_upload(file_content: bytes, mime_type: str) -> ExtractionResult:
    """Process uploaded file content with known MIME type."""
    return await extract_bytes(
        file_content,
        mime_type=mime_type,
    )


# Example usage with different file types
async def handle_uploads(docx_bytes: bytes, pdf_bytes: bytes, image_bytes: bytes):
    # Process PDF upload
    pdf_result = await process_upload(pdf_bytes, mime_type="application/pdf")
    print(f"PDF content: {pdf_result.content}")
    print(f"PDF metadata: {pdf_result.metadata}")

    # Process image upload (will use OCR)
    img_result = await process_upload(image_bytes, mime_type="image/jpeg")
    print(f"Image text: {img_result.content}")

    # Process Word document upload
    docx_result = await process_upload(
        docx_bytes,
        mime_type="application/vnd.openxmlformats-officedocument.wordprocessingml.document"
    )
    print(f"Word content: {docx_result.content}")
```

### Batch Processing

Kreuzberg supports efficient batch processing of multiple files or byte contents:

```python
from pathlib import Path
from kreuzberg import batch_extract_file, batch_extract_bytes, batch_extract_file_sync


# Process multiple files concurrently
async def process_documents(file_paths: list[Path]) -> None:
    # Extract from multiple files
    results = await batch_extract_file(file_paths)
    for path, result in zip(file_paths, results):
        print(f"File {path}: {result.content[:100]}...")


# Process multiple uploaded files concurrently
async def process_uploads(contents: list[tuple[bytes, str]]) -> None:
    # Each item is a tuple of (content, mime_type)
    results = await batch_extract_bytes(contents)
    for (_, mime_type), result in zip(contents, results):
        print(f"Upload {mime_type}: {result.content[:100]}...")


# Synchronous batch processing is also available
def process_documents_sync(file_paths: list[Path]) -> None:
    results = batch_extract_file_sync(file_paths)
    for path, result in zip(file_paths, results):
        print(f"File {path}: {result.content[:100]}...")
```

Features:

- Ordered results
- Concurrent processing
- Error handling per item
- Async and sync interfaces
- Same options as single extraction

### PDF Processing

Kreuzberg employs a smart approach to PDF text extraction:

1. **Searchable Text Detection**: First attempts to extract text directly from searchable PDFs using `pdfium2`.

2. **Text Validation**: Extracted text is validated for corruption by checking for:

   - Control and non-printable characters
   - Unicode replacement characters (�)
   - Zero-width spaces and other invisible characters
   - Empty or whitespace-only content

3. **Automatic OCR Fallback**: If the extracted text appears corrupted or if the PDF is image-based, automatically falls back to OCR using Tesseract.

This approach works well for searchable PDFs and standard text documents. For complex OCR (e.g., handwriting, photographs), use a specialized tool.

### PDF Processing Options

You can control PDF processing behavior using optional parameters:

```python
from kreuzberg import extract_file


async def process_pdf():
  # Default behavior: auto-detect and use OCR if needed
  # By default, max_processes=1 for safe operation
  result = await extract_file("document.pdf")
  print(result.content)

  # Force OCR even for searchable PDFs
  result = await extract_file("document.pdf", force_ocr=True)
  print(result.content)

  # Control OCR concurrency for large documents
  # Warning: High concurrency values can cause system resource exhaustion
  # Start with a low value and increase based on your system's capabilities
  result = await extract_file(
    "large_document.pdf",
    max_processes=4  # Process up to 4 pages concurrently
  )
  print(result.content)

  # Process a scanned PDF (automatically uses OCR)
  result = await extract_file("scanned.pdf")
  print(result.content)
```

### ExtractionResult Object

All extraction functions return an `ExtractionResult` or a list thereof (for batch functions). The `ExtractionResult` object is a `NamedTuple`:

- `content`: The extracted text (str)
- `mime_type`: Output format ("text/plain" or "text/markdown" for Pandoc conversions)
- `metadata`: A metadata dictionary. Currently this dictionary is only populated when extracting documents using pandoc.

```python
from kreuzberg import extract_file, ExtractionResult, Metadata

async def process_document(path: str) -> tuple[str, str, Metadata]:
    # Access as a named tuple
    result: ExtractionResult = await extract_file(path)
    print(f"Content: {result.content}")
    print(f"Format: {result.mime_type}")

    # Or unpack as a tuple
    content, mime_type, metadata = await extract_file(path)
    return content, mime_type, metadata
```

### Error Handling

Kreuzberg provides comprehensive error handling through several exception types, all inheriting from `KreuzbergError`. Each exception includes helpful context information for debugging.

```python
from kreuzberg import (
    extract_file,
    ValidationError,
    ParsingError,
    OCRError,
    MissingDependencyError
)

async def safe_extract(path: str) -> str:
    try:
        result = await extract_file(path)
        return result.content

    except ValidationError as e:
        # Input validation issues
        # - Unsupported or undetectable MIME types
        # - Missing files
        # - Invalid input parameters
        print(f"Validation failed: {e}")

    except OCRError as e:
        # OCR-specific issues
        # - Tesseract processing failures
        # - Image conversion problems
        print(f"OCR failed: {e}")

    except MissingDependencyError as e:
        # System dependency issues
        # - Missing Tesseract OCR
        # - Missing Pandoc
        # - Incompatible versions
        print(f"Dependency missing: {e}")

    except ParsingError as e:
        # General processing errors
        # - PDF parsing failures
        # - Format conversion issues
        # - Encoding problems
        print(f"Processing failed: {e}")

    return ""
```

All exceptions include:

- Error message
- Context in the `context` attribute
- String representation
- Exception chaining

## Contribution

This library is open to contribution. Feel free to open issues or submit PRs. Its better to discuss issues before
submitting PRs to avoid disappointment.

### Local Development

1. Clone the repo
2. Install the system dependencies
3. Install the full dependencies with `uv sync`
4. Install the pre-commit hooks with:

   ```shell
   pre-commit install && pre-commit install --hook-type commit-msg
   ```

5. Make your changes and submit a PR

## License

This library uses the MIT license.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "kreuzberg",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "document-processing, image-to-text, ocr, pandoc, pdf-extraction, rag, tesseract, text-extraction, text-processing",
    "author": null,
    "author_email": "Na'aman Hirschfeld <nhirschfed@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/7a/2a/67cf078536ff58c3c46918e166e065896ace645da7c8716cb063f15f942e/kreuzberg-2.1.0.tar.gz",
    "platform": null,
    "description": "# Kreuzberg\n\nKreuzberg is a Python library for text extraction from documents. It provides a unified async interface for extracting text from PDFs, images, office documents, and more.\n\n## Why Kreuzberg?\n\n- **Simple and Hassle-Free**: Clean API that just works, without complex configuration\n- **Local Processing**: No external API calls or cloud dependencies required\n- **Resource Efficient**: Lightweight processing without GPU requirements\n- **Small Package Size**: Has few curated dependencies and a minimal footprint\n- **Format Support**: Comprehensive support for documents, images, and text formats\n- **Modern Python**: Built with async/await, type hints, and functional first approach\n- **Permissive OSS**: Kreuzberg and its dependencies have a permissive OSS license\n\nKreuzberg was built for RAG (Retrieval Augmented Generation) applications, focusing on local processing with minimal dependencies. Its designed for modern async applications, serverless functions, and dockerized applications.\n\n## Installation\n\n### 1. Install the Python Package\n\n```shell\npip install kreuzberg\n```\n\n### 2. Install System Dependencies\n\nKreuzberg requires two system level dependencies:\n\n- [Pandoc](https://pandoc.org/installing.html) - For document format conversion. Minimum required version is Pandoc 2.\n- [Tesseract OCR](https://tesseract-ocr.github.io/) - For image and PDF OCR. Minimum required version is Tesseract 4.\n\nYou can install these with:\n\n#### Linux (Ubuntu)\n\n```shell\nsudo apt-get install pandoc tesseract-ocr\n```\n\n#### MacOS\n\n```shell\n#\nbrew install tesseract pandoc\n```\n\n#### Windows\n\n```shell\nchoco install -y tesseract pandoc\n```\n\nNotes:\n\n- in most distributions the tesseract-ocr package is split into multiple packages, you may need to install any language models you need aside from English separately.\n- please consult the official documentation for these libraries for the most up-to-date installation instructions for your platform.\n\n## Architecture\n\nKreuzberg integrates:\n\n- **PDF Processing**:\n  - `pdfium2` for searchable PDFs\n  - Tesseract OCR for scanned content\n- **Document Conversion**:\n  - Pandoc for many document and markup formats\n  - `python-pptx` for PowerPoint files\n  - `html-to-markdown` for HTML content\n  - `calamine` for Excel spreadsheets (with multi-sheet support)\n- **Text Processing**:\n  - Smart encoding detection\n  - Markdown and plain text handling\n\n### Supported Formats\n\n#### Document Formats\n\n- PDF (`.pdf`, both searchable and scanned)\n- Microsoft Word (`.docx`)\n- PowerPoint presentations (`.pptx`)\n- OpenDocument Text (`.odt`)\n- Rich Text Format (`.rtf`)\n- EPUB (`.epub`)\n- DocBook XML (`.dbk`, `.xml`)\n- FictionBook (`.fb2`)\n- LaTeX (`.tex`, `.latex`)\n- Typst (`.typ`)\n\n#### Markup and Text Formats\n\n- HTML (`.html`, `.htm`)\n- Plain text (`.txt`) and Markdown (`.md`, `.markdown`)\n- reStructuredText (`.rst`)\n- Org-mode (`.org`)\n- DokuWiki (`.txt`)\n- Pod (`.pod`)\n- Troff/Man (`.1`, `.2`, etc.)\n\n#### Data and Research Formats\n\n- Spreadsheets (`.xlsx`, `.xls`, `.xlsm`, `.xlsb`, `.xlam`, `.xla`, `.ods`)\n- CSV (`.csv`) and TSV (`.tsv`) files\n- OPML files (`.opml`)\n- Jupyter Notebooks (`.ipynb`)\n- BibTeX (`.bib`) and BibLaTeX (`.bib`)\n- CSL-JSON (`.json`)\n- EndNote and JATS XML (`.xml`)\n- RIS (`.ris`)\n\n#### Image Formats\n\n- JPEG (`.jpg`, `.jpeg`, `.pjpeg`)\n- PNG (`.png`)\n- TIFF (`.tiff`, `.tif`)\n- BMP (`.bmp`)\n- GIF (`.gif`)\n- JPEG 2000 family (`.jp2`, `.jpm`, `.jpx`, `.mj2`)\n- WebP (`.webp`)\n- Portable anymap formats (`.pbm`, `.pgm`, `.ppm`, `.pnm`)\n\n## Usage\n\nKreuzberg provides both async and sync APIs for text extraction, including batch processing. The library exports the following main functions:\n\n- Single Item Processing:\n\n  - `extract_file()`: Async function to extract text from a file (accepts string path or `pathlib.Path`)\n  - `extract_bytes()`: Async function to extract text from bytes (accepts a byte string)\n  - `extract_file_sync()`: Synchronous version of `extract_file()`\n  - `extract_bytes_sync()`: Synchronous version of `extract_bytes()`\n\n- Batch Processing:\n  - `batch_extract_file()`: Async function to extract text from multiple files concurrently\n  - `batch_extract_bytes()`: Async function to extract text from multiple byte contents concurrently\n  - `batch_extract_file_sync()`: Synchronous version of `batch_extract_file()`\n  - `batch_extract_bytes_sync()`: Synchronous version of `batch_extract_bytes()`\n\n### Configuration Parameters\n\nAll extraction functions accept the following optional parameters for configuring OCR and performance:\n\n#### OCR Configuration\n\n- `force_ocr`(default: `False`): Forces OCR processing even for searchable PDFs.\n- `language` (default: `eng`): Specifies the language model for Tesseract OCR. This affects text recognition accuracy for documents in different languages. Examples:\n\n  - `eng` for English\n  - `deu` for German\n  - `eng+deu` for English and German\n\n  Notes: - the order of languages effect processing time, the first language is the primary language and the second language is the secondary language etc.\n\n- `psm` (Page Segmentation Mode, default: `PSM.AUTO`): Controls how Tesseract analyzes page layout. In most cases you do not need to change this to a different value.\n\nConsult the [Tesseract documentation](https://tesseract-ocr.github.io/tessdoc/) for more information on both options.\n\n#### Processing Configuration\n\n- `max_processes` (default: CPU count): Maximum number of concurrent processes for Tesseract.\n\n### Quick Start\n\n```python\nfrom pathlib import Path\nfrom kreuzberg import extract_file\nfrom kreuzberg import ExtractionResult\nfrom kreuzberg import PSMMode\n\n\n# Basic file extraction\nasync def extract_document():\n    # Extract from a PDF file with default settings\n    pdf_result: ExtractionResult = await extract_file(\"document.pdf\")\n    print(f\"Content: {pdf_result.content}\")\n\n    # Extract from an image with German language model\n    img_result = await extract_file(\n        \"scan.png\",\n        language=\"deu\",  # German language model\n        psm=PSMMode.SINGLE_BLOCK,  # Treat as single block of text\n        max_processes=4  # Limit concurrent processes\n    )\n    print(f\"Image text: {img_result.content}\")\n\n    # Extract from Word document with metadata\n    docx_result = await extract_file(Path(\"document.docx\"))\n    if docx_result.metadata:\n        print(f\"Title: {docx_result.metadata.get('title')}\")\n        print(f\"Author: {docx_result.metadata.get('creator')}\")\n```\n\n### Extracting Bytes\n\n```python\nfrom kreuzberg import extract_bytes\nfrom kreuzberg import ExtractionResult\n\n\nasync def process_upload(file_content: bytes, mime_type: str) -> ExtractionResult:\n    \"\"\"Process uploaded file content with known MIME type.\"\"\"\n    return await extract_bytes(\n        file_content,\n        mime_type=mime_type,\n    )\n\n\n# Example usage with different file types\nasync def handle_uploads(docx_bytes: bytes, pdf_bytes: bytes, image_bytes: bytes):\n    # Process PDF upload\n    pdf_result = await process_upload(pdf_bytes, mime_type=\"application/pdf\")\n    print(f\"PDF content: {pdf_result.content}\")\n    print(f\"PDF metadata: {pdf_result.metadata}\")\n\n    # Process image upload (will use OCR)\n    img_result = await process_upload(image_bytes, mime_type=\"image/jpeg\")\n    print(f\"Image text: {img_result.content}\")\n\n    # Process Word document upload\n    docx_result = await process_upload(\n        docx_bytes,\n        mime_type=\"application/vnd.openxmlformats-officedocument.wordprocessingml.document\"\n    )\n    print(f\"Word content: {docx_result.content}\")\n```\n\n### Batch Processing\n\nKreuzberg supports efficient batch processing of multiple files or byte contents:\n\n```python\nfrom pathlib import Path\nfrom kreuzberg import batch_extract_file, batch_extract_bytes, batch_extract_file_sync\n\n\n# Process multiple files concurrently\nasync def process_documents(file_paths: list[Path]) -> None:\n    # Extract from multiple files\n    results = await batch_extract_file(file_paths)\n    for path, result in zip(file_paths, results):\n        print(f\"File {path}: {result.content[:100]}...\")\n\n\n# Process multiple uploaded files concurrently\nasync def process_uploads(contents: list[tuple[bytes, str]]) -> None:\n    # Each item is a tuple of (content, mime_type)\n    results = await batch_extract_bytes(contents)\n    for (_, mime_type), result in zip(contents, results):\n        print(f\"Upload {mime_type}: {result.content[:100]}...\")\n\n\n# Synchronous batch processing is also available\ndef process_documents_sync(file_paths: list[Path]) -> None:\n    results = batch_extract_file_sync(file_paths)\n    for path, result in zip(file_paths, results):\n        print(f\"File {path}: {result.content[:100]}...\")\n```\n\nFeatures:\n\n- Ordered results\n- Concurrent processing\n- Error handling per item\n- Async and sync interfaces\n- Same options as single extraction\n\n### PDF Processing\n\nKreuzberg employs a smart approach to PDF text extraction:\n\n1. **Searchable Text Detection**: First attempts to extract text directly from searchable PDFs using `pdfium2`.\n\n2. **Text Validation**: Extracted text is validated for corruption by checking for:\n\n   - Control and non-printable characters\n   - Unicode replacement characters (\ufffd)\n   - Zero-width spaces and other invisible characters\n   - Empty or whitespace-only content\n\n3. **Automatic OCR Fallback**: If the extracted text appears corrupted or if the PDF is image-based, automatically falls back to OCR using Tesseract.\n\nThis approach works well for searchable PDFs and standard text documents. For complex OCR (e.g., handwriting, photographs), use a specialized tool.\n\n### PDF Processing Options\n\nYou can control PDF processing behavior using optional parameters:\n\n```python\nfrom kreuzberg import extract_file\n\n\nasync def process_pdf():\n  # Default behavior: auto-detect and use OCR if needed\n  # By default, max_processes=1 for safe operation\n  result = await extract_file(\"document.pdf\")\n  print(result.content)\n\n  # Force OCR even for searchable PDFs\n  result = await extract_file(\"document.pdf\", force_ocr=True)\n  print(result.content)\n\n  # Control OCR concurrency for large documents\n  # Warning: High concurrency values can cause system resource exhaustion\n  # Start with a low value and increase based on your system's capabilities\n  result = await extract_file(\n    \"large_document.pdf\",\n    max_processes=4  # Process up to 4 pages concurrently\n  )\n  print(result.content)\n\n  # Process a scanned PDF (automatically uses OCR)\n  result = await extract_file(\"scanned.pdf\")\n  print(result.content)\n```\n\n### ExtractionResult Object\n\nAll extraction functions return an `ExtractionResult` or a list thereof (for batch functions). The `ExtractionResult` object is a `NamedTuple`:\n\n- `content`: The extracted text (str)\n- `mime_type`: Output format (\"text/plain\" or \"text/markdown\" for Pandoc conversions)\n- `metadata`: A metadata dictionary. Currently this dictionary is only populated when extracting documents using pandoc.\n\n```python\nfrom kreuzberg import extract_file, ExtractionResult, Metadata\n\nasync def process_document(path: str) -> tuple[str, str, Metadata]:\n    # Access as a named tuple\n    result: ExtractionResult = await extract_file(path)\n    print(f\"Content: {result.content}\")\n    print(f\"Format: {result.mime_type}\")\n\n    # Or unpack as a tuple\n    content, mime_type, metadata = await extract_file(path)\n    return content, mime_type, metadata\n```\n\n### Error Handling\n\nKreuzberg provides comprehensive error handling through several exception types, all inheriting from `KreuzbergError`. Each exception includes helpful context information for debugging.\n\n```python\nfrom kreuzberg import (\n    extract_file,\n    ValidationError,\n    ParsingError,\n    OCRError,\n    MissingDependencyError\n)\n\nasync def safe_extract(path: str) -> str:\n    try:\n        result = await extract_file(path)\n        return result.content\n\n    except ValidationError as e:\n        # Input validation issues\n        # - Unsupported or undetectable MIME types\n        # - Missing files\n        # - Invalid input parameters\n        print(f\"Validation failed: {e}\")\n\n    except OCRError as e:\n        # OCR-specific issues\n        # - Tesseract processing failures\n        # - Image conversion problems\n        print(f\"OCR failed: {e}\")\n\n    except MissingDependencyError as e:\n        # System dependency issues\n        # - Missing Tesseract OCR\n        # - Missing Pandoc\n        # - Incompatible versions\n        print(f\"Dependency missing: {e}\")\n\n    except ParsingError as e:\n        # General processing errors\n        # - PDF parsing failures\n        # - Format conversion issues\n        # - Encoding problems\n        print(f\"Processing failed: {e}\")\n\n    return \"\"\n```\n\nAll exceptions include:\n\n- Error message\n- Context in the `context` attribute\n- String representation\n- Exception chaining\n\n## Contribution\n\nThis library is open to contribution. Feel free to open issues or submit PRs. Its better to discuss issues before\nsubmitting PRs to avoid disappointment.\n\n### Local Development\n\n1. Clone the repo\n2. Install the system dependencies\n3. Install the full dependencies with `uv sync`\n4. Install the pre-commit hooks with:\n\n   ```shell\n   pre-commit install && pre-commit install --hook-type commit-msg\n   ```\n\n5. Make your changes and submit a PR\n\n## License\n\nThis library uses the MIT license.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A text extraction library supporting PDFs, images, office documents and more",
    "version": "2.1.0",
    "project_urls": {
        "homepage": "https://github.com/Goldziher/kreuzberg"
    },
    "split_keywords": [
        "document-processing",
        " image-to-text",
        " ocr",
        " pandoc",
        " pdf-extraction",
        " rag",
        " tesseract",
        " text-extraction",
        " text-processing"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "69bda8f077dc157c00a64e1e51b35c697e02983070b6526d7042dfc11cb74003",
                "md5": "54b59d126c16f5d7530ca038ae90660f",
                "sha256": "fc7ac3d0f47f10783e4c1232b048a34a6827320e14148a219f98bd438c076c61"
            },
            "downloads": -1,
            "filename": "kreuzberg-2.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "54b59d126c16f5d7530ca038ae90660f",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 26902,
            "upload_time": "2025-02-20T15:26:39",
            "upload_time_iso_8601": "2025-02-20T15:26:39.086002Z",
            "url": "https://files.pythonhosted.org/packages/69/bd/a8f077dc157c00a64e1e51b35c697e02983070b6526d7042dfc11cb74003/kreuzberg-2.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "7a2a67cf078536ff58c3c46918e166e065896ace645da7c8716cb063f15f942e",
                "md5": "ad03dc9654b966bc1eb8deba0140435d",
                "sha256": "2f472b1d9c1f753652ecb9c849a47fd1fa6f283631c95e484d731d414991ceba"
            },
            "downloads": -1,
            "filename": "kreuzberg-2.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "ad03dc9654b966bc1eb8deba0140435d",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 26300,
            "upload_time": "2025-02-20T15:26:43",
            "upload_time_iso_8601": "2025-02-20T15:26:43.054978Z",
            "url": "https://files.pythonhosted.org/packages/7a/2a/67cf078536ff58c3c46918e166e065896ace645da7c8716cb063f15f942e/kreuzberg-2.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-02-20 15:26:43",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Goldziher",
    "github_project": "kreuzberg",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "kreuzberg"
}
        
Elapsed time: 0.40509s