# Kreuzberg
Kreuzberg is a Python library for text extraction from documents. It provides a unified async interface for extracting text from PDFs, images, office documents, and more.
## Why Kreuzberg?
- **Simple and Hassle-Free**: Clean API that just works, without complex configuration
- **Local Processing**: No external API calls or cloud dependencies required
- **Resource Efficient**: Lightweight processing without GPU requirements
- **Small Package Size**: Has few curated dependencies and a minimal footprint
- **Format Support**: Comprehensive support for documents, images, and text formats
- **Modern Python**: Built with async/await, type hints, and functional first approach
- **Permissive OSS**: Kreuzberg and its dependencies have a permissive OSS license
Kreuzberg was built for RAG (Retrieval Augmented Generation) applications, focusing on local processing with minimal dependencies. Its designed for modern async applications, serverless functions, and dockerized applications.
## Installation
### 1. Install the Python Package
```shell
pip install kreuzberg
```
### 2. Install System Dependencies
Kreuzberg requires two system level dependencies:
- [Pandoc](https://pandoc.org/installing.html) - For document format conversion. Minimum required version is Pandoc 2.
- [Tesseract OCR](https://tesseract-ocr.github.io/) - For image and PDF OCR. Minimum required version is Tesseract 4.
You can install these with:
#### Linux (Ubuntu)
```shell
sudo apt-get install pandoc tesseract-ocr
```
#### MacOS
```shell
#
brew install tesseract pandoc
```
#### Windows
```shell
choco install -y tesseract pandoc
```
Notes:
- in most distributions the tesseract-ocr package is split into multiple packages, you may need to install any language models you need aside from English separately.
- please consult the official documentation for these libraries for the most up-to-date installation instructions for your platform.
## Architecture
Kreuzberg integrates:
- **PDF Processing**:
- `pdfium2` for searchable PDFs
- Tesseract OCR for scanned content
- **Document Conversion**:
- Pandoc for many document and markup formats
- `python-pptx` for PowerPoint files
- `html-to-markdown` for HTML content
- `calamine` for Excel spreadsheets (with multi-sheet support)
- **Text Processing**:
- Smart encoding detection
- Markdown and plain text handling
### Supported Formats
#### Document Formats
- PDF (`.pdf`, both searchable and scanned)
- Microsoft Word (`.docx`)
- PowerPoint presentations (`.pptx`)
- OpenDocument Text (`.odt`)
- Rich Text Format (`.rtf`)
- EPUB (`.epub`)
- DocBook XML (`.dbk`, `.xml`)
- FictionBook (`.fb2`)
- LaTeX (`.tex`, `.latex`)
- Typst (`.typ`)
#### Markup and Text Formats
- HTML (`.html`, `.htm`)
- Plain text (`.txt`) and Markdown (`.md`, `.markdown`)
- reStructuredText (`.rst`)
- Org-mode (`.org`)
- DokuWiki (`.txt`)
- Pod (`.pod`)
- Troff/Man (`.1`, `.2`, etc.)
#### Data and Research Formats
- Spreadsheets (`.xlsx`, `.xls`, `.xlsm`, `.xlsb`, `.xlam`, `.xla`, `.ods`)
- CSV (`.csv`) and TSV (`.tsv`) files
- OPML files (`.opml`)
- Jupyter Notebooks (`.ipynb`)
- BibTeX (`.bib`) and BibLaTeX (`.bib`)
- CSL-JSON (`.json`)
- EndNote and JATS XML (`.xml`)
- RIS (`.ris`)
#### Image Formats
- JPEG (`.jpg`, `.jpeg`, `.pjpeg`)
- PNG (`.png`)
- TIFF (`.tiff`, `.tif`)
- BMP (`.bmp`)
- GIF (`.gif`)
- JPEG 2000 family (`.jp2`, `.jpm`, `.jpx`, `.mj2`)
- WebP (`.webp`)
- Portable anymap formats (`.pbm`, `.pgm`, `.ppm`, `.pnm`)
## Usage
Kreuzberg provides both async and sync APIs for text extraction, including batch processing. The library exports the following main functions:
- Single Item Processing:
- `extract_file()`: Async function to extract text from a file (accepts string path or `pathlib.Path`)
- `extract_bytes()`: Async function to extract text from bytes (accepts a byte string)
- `extract_file_sync()`: Synchronous version of `extract_file()`
- `extract_bytes_sync()`: Synchronous version of `extract_bytes()`
- Batch Processing:
- `batch_extract_file()`: Async function to extract text from multiple files concurrently
- `batch_extract_bytes()`: Async function to extract text from multiple byte contents concurrently
- `batch_extract_file_sync()`: Synchronous version of `batch_extract_file()`
- `batch_extract_bytes_sync()`: Synchronous version of `batch_extract_bytes()`
### Configuration Parameters
All extraction functions accept the following optional parameters for configuring OCR and performance:
#### OCR Configuration
- `force_ocr`(default: `False`): Forces OCR processing even for searchable PDFs.
- `language` (default: `eng`): Specifies the language model for Tesseract OCR. This affects text recognition accuracy for documents in different languages. Examples:
- `eng` for English
- `deu` for German
- `eng+deu` for English and German
Notes: - the order of languages effect processing time, the first language is the primary language and the second language is the secondary language etc.
- `psm` (Page Segmentation Mode, default: `PSM.AUTO`): Controls how Tesseract analyzes page layout. In most cases you do not need to change this to a different value.
Consult the [Tesseract documentation](https://tesseract-ocr.github.io/tessdoc/) for more information on both options.
#### Processing Configuration
- `max_processes` (default: CPU count): Maximum number of concurrent processes for Tesseract.
### Quick Start
```python
from pathlib import Path
from kreuzberg import extract_file
from kreuzberg import ExtractionResult
from kreuzberg import PSMMode
# Basic file extraction
async def extract_document():
# Extract from a PDF file with default settings
pdf_result: ExtractionResult = await extract_file("document.pdf")
print(f"Content: {pdf_result.content}")
# Extract from an image with German language model
img_result = await extract_file(
"scan.png",
language="deu", # German language model
psm=PSMMode.SINGLE_BLOCK, # Treat as single block of text
max_processes=4 # Limit concurrent processes
)
print(f"Image text: {img_result.content}")
# Extract from Word document with metadata
docx_result = await extract_file(Path("document.docx"))
if docx_result.metadata:
print(f"Title: {docx_result.metadata.get('title')}")
print(f"Author: {docx_result.metadata.get('creator')}")
```
### Extracting Bytes
```python
from kreuzberg import extract_bytes
from kreuzberg import ExtractionResult
async def process_upload(file_content: bytes, mime_type: str) -> ExtractionResult:
"""Process uploaded file content with known MIME type."""
return await extract_bytes(
file_content,
mime_type=mime_type,
)
# Example usage with different file types
async def handle_uploads(docx_bytes: bytes, pdf_bytes: bytes, image_bytes: bytes):
# Process PDF upload
pdf_result = await process_upload(pdf_bytes, mime_type="application/pdf")
print(f"PDF content: {pdf_result.content}")
print(f"PDF metadata: {pdf_result.metadata}")
# Process image upload (will use OCR)
img_result = await process_upload(image_bytes, mime_type="image/jpeg")
print(f"Image text: {img_result.content}")
# Process Word document upload
docx_result = await process_upload(
docx_bytes,
mime_type="application/vnd.openxmlformats-officedocument.wordprocessingml.document"
)
print(f"Word content: {docx_result.content}")
```
### Batch Processing
Kreuzberg supports efficient batch processing of multiple files or byte contents:
```python
from pathlib import Path
from kreuzberg import batch_extract_file, batch_extract_bytes, batch_extract_file_sync
# Process multiple files concurrently
async def process_documents(file_paths: list[Path]) -> None:
# Extract from multiple files
results = await batch_extract_file(file_paths)
for path, result in zip(file_paths, results):
print(f"File {path}: {result.content[:100]}...")
# Process multiple uploaded files concurrently
async def process_uploads(contents: list[tuple[bytes, str]]) -> None:
# Each item is a tuple of (content, mime_type)
results = await batch_extract_bytes(contents)
for (_, mime_type), result in zip(contents, results):
print(f"Upload {mime_type}: {result.content[:100]}...")
# Synchronous batch processing is also available
def process_documents_sync(file_paths: list[Path]) -> None:
results = batch_extract_file_sync(file_paths)
for path, result in zip(file_paths, results):
print(f"File {path}: {result.content[:100]}...")
```
Features:
- Ordered results
- Concurrent processing
- Error handling per item
- Async and sync interfaces
- Same options as single extraction
### PDF Processing
Kreuzberg employs a smart approach to PDF text extraction:
1. **Searchable Text Detection**: First attempts to extract text directly from searchable PDFs using `pdfium2`.
2. **Text Validation**: Extracted text is validated for corruption by checking for:
- Control and non-printable characters
- Unicode replacement characters (�)
- Zero-width spaces and other invisible characters
- Empty or whitespace-only content
3. **Automatic OCR Fallback**: If the extracted text appears corrupted or if the PDF is image-based, automatically falls back to OCR using Tesseract.
This approach works well for searchable PDFs and standard text documents. For complex OCR (e.g., handwriting, photographs), use a specialized tool.
### PDF Processing Options
You can control PDF processing behavior using optional parameters:
```python
from kreuzberg import extract_file
async def process_pdf():
# Default behavior: auto-detect and use OCR if needed
# By default, max_processes=1 for safe operation
result = await extract_file("document.pdf")
print(result.content)
# Force OCR even for searchable PDFs
result = await extract_file("document.pdf", force_ocr=True)
print(result.content)
# Control OCR concurrency for large documents
# Warning: High concurrency values can cause system resource exhaustion
# Start with a low value and increase based on your system's capabilities
result = await extract_file(
"large_document.pdf",
max_processes=4 # Process up to 4 pages concurrently
)
print(result.content)
# Process a scanned PDF (automatically uses OCR)
result = await extract_file("scanned.pdf")
print(result.content)
```
### ExtractionResult Object
All extraction functions return an `ExtractionResult` or a list thereof (for batch functions). The `ExtractionResult` object is a `NamedTuple`:
- `content`: The extracted text (str)
- `mime_type`: Output format ("text/plain" or "text/markdown" for Pandoc conversions)
- `metadata`: A metadata dictionary. Currently this dictionary is only populated when extracting documents using pandoc.
```python
from kreuzberg import extract_file, ExtractionResult, Metadata
async def process_document(path: str) -> tuple[str, str, Metadata]:
# Access as a named tuple
result: ExtractionResult = await extract_file(path)
print(f"Content: {result.content}")
print(f"Format: {result.mime_type}")
# Or unpack as a tuple
content, mime_type, metadata = await extract_file(path)
return content, mime_type, metadata
```
### Error Handling
Kreuzberg provides comprehensive error handling through several exception types, all inheriting from `KreuzbergError`. Each exception includes helpful context information for debugging.
```python
from kreuzberg import (
extract_file,
ValidationError,
ParsingError,
OCRError,
MissingDependencyError
)
async def safe_extract(path: str) -> str:
try:
result = await extract_file(path)
return result.content
except ValidationError as e:
# Input validation issues
# - Unsupported or undetectable MIME types
# - Missing files
# - Invalid input parameters
print(f"Validation failed: {e}")
except OCRError as e:
# OCR-specific issues
# - Tesseract processing failures
# - Image conversion problems
print(f"OCR failed: {e}")
except MissingDependencyError as e:
# System dependency issues
# - Missing Tesseract OCR
# - Missing Pandoc
# - Incompatible versions
print(f"Dependency missing: {e}")
except ParsingError as e:
# General processing errors
# - PDF parsing failures
# - Format conversion issues
# - Encoding problems
print(f"Processing failed: {e}")
return ""
```
All exceptions include:
- Error message
- Context in the `context` attribute
- String representation
- Exception chaining
## Contribution
This library is open to contribution. Feel free to open issues or submit PRs. Its better to discuss issues before
submitting PRs to avoid disappointment.
### Local Development
1. Clone the repo
2. Install the system dependencies
3. Install the full dependencies with `uv sync`
4. Install the pre-commit hooks with:
```shell
pre-commit install && pre-commit install --hook-type commit-msg
```
5. Make your changes and submit a PR
## License
This library uses the MIT license.
Raw data
{
"_id": null,
"home_page": null,
"name": "kreuzberg",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "document-processing, image-to-text, ocr, pandoc, pdf-extraction, rag, tesseract, text-extraction, text-processing",
"author": null,
"author_email": "Na'aman Hirschfeld <nhirschfed@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/7a/2a/67cf078536ff58c3c46918e166e065896ace645da7c8716cb063f15f942e/kreuzberg-2.1.0.tar.gz",
"platform": null,
"description": "# Kreuzberg\n\nKreuzberg is a Python library for text extraction from documents. It provides a unified async interface for extracting text from PDFs, images, office documents, and more.\n\n## Why Kreuzberg?\n\n- **Simple and Hassle-Free**: Clean API that just works, without complex configuration\n- **Local Processing**: No external API calls or cloud dependencies required\n- **Resource Efficient**: Lightweight processing without GPU requirements\n- **Small Package Size**: Has few curated dependencies and a minimal footprint\n- **Format Support**: Comprehensive support for documents, images, and text formats\n- **Modern Python**: Built with async/await, type hints, and functional first approach\n- **Permissive OSS**: Kreuzberg and its dependencies have a permissive OSS license\n\nKreuzberg was built for RAG (Retrieval Augmented Generation) applications, focusing on local processing with minimal dependencies. Its designed for modern async applications, serverless functions, and dockerized applications.\n\n## Installation\n\n### 1. Install the Python Package\n\n```shell\npip install kreuzberg\n```\n\n### 2. Install System Dependencies\n\nKreuzberg requires two system level dependencies:\n\n- [Pandoc](https://pandoc.org/installing.html) - For document format conversion. Minimum required version is Pandoc 2.\n- [Tesseract OCR](https://tesseract-ocr.github.io/) - For image and PDF OCR. Minimum required version is Tesseract 4.\n\nYou can install these with:\n\n#### Linux (Ubuntu)\n\n```shell\nsudo apt-get install pandoc tesseract-ocr\n```\n\n#### MacOS\n\n```shell\n#\nbrew install tesseract pandoc\n```\n\n#### Windows\n\n```shell\nchoco install -y tesseract pandoc\n```\n\nNotes:\n\n- in most distributions the tesseract-ocr package is split into multiple packages, you may need to install any language models you need aside from English separately.\n- please consult the official documentation for these libraries for the most up-to-date installation instructions for your platform.\n\n## Architecture\n\nKreuzberg integrates:\n\n- **PDF Processing**:\n - `pdfium2` for searchable PDFs\n - Tesseract OCR for scanned content\n- **Document Conversion**:\n - Pandoc for many document and markup formats\n - `python-pptx` for PowerPoint files\n - `html-to-markdown` for HTML content\n - `calamine` for Excel spreadsheets (with multi-sheet support)\n- **Text Processing**:\n - Smart encoding detection\n - Markdown and plain text handling\n\n### Supported Formats\n\n#### Document Formats\n\n- PDF (`.pdf`, both searchable and scanned)\n- Microsoft Word (`.docx`)\n- PowerPoint presentations (`.pptx`)\n- OpenDocument Text (`.odt`)\n- Rich Text Format (`.rtf`)\n- EPUB (`.epub`)\n- DocBook XML (`.dbk`, `.xml`)\n- FictionBook (`.fb2`)\n- LaTeX (`.tex`, `.latex`)\n- Typst (`.typ`)\n\n#### Markup and Text Formats\n\n- HTML (`.html`, `.htm`)\n- Plain text (`.txt`) and Markdown (`.md`, `.markdown`)\n- reStructuredText (`.rst`)\n- Org-mode (`.org`)\n- DokuWiki (`.txt`)\n- Pod (`.pod`)\n- Troff/Man (`.1`, `.2`, etc.)\n\n#### Data and Research Formats\n\n- Spreadsheets (`.xlsx`, `.xls`, `.xlsm`, `.xlsb`, `.xlam`, `.xla`, `.ods`)\n- CSV (`.csv`) and TSV (`.tsv`) files\n- OPML files (`.opml`)\n- Jupyter Notebooks (`.ipynb`)\n- BibTeX (`.bib`) and BibLaTeX (`.bib`)\n- CSL-JSON (`.json`)\n- EndNote and JATS XML (`.xml`)\n- RIS (`.ris`)\n\n#### Image Formats\n\n- JPEG (`.jpg`, `.jpeg`, `.pjpeg`)\n- PNG (`.png`)\n- TIFF (`.tiff`, `.tif`)\n- BMP (`.bmp`)\n- GIF (`.gif`)\n- JPEG 2000 family (`.jp2`, `.jpm`, `.jpx`, `.mj2`)\n- WebP (`.webp`)\n- Portable anymap formats (`.pbm`, `.pgm`, `.ppm`, `.pnm`)\n\n## Usage\n\nKreuzberg provides both async and sync APIs for text extraction, including batch processing. The library exports the following main functions:\n\n- Single Item Processing:\n\n - `extract_file()`: Async function to extract text from a file (accepts string path or `pathlib.Path`)\n - `extract_bytes()`: Async function to extract text from bytes (accepts a byte string)\n - `extract_file_sync()`: Synchronous version of `extract_file()`\n - `extract_bytes_sync()`: Synchronous version of `extract_bytes()`\n\n- Batch Processing:\n - `batch_extract_file()`: Async function to extract text from multiple files concurrently\n - `batch_extract_bytes()`: Async function to extract text from multiple byte contents concurrently\n - `batch_extract_file_sync()`: Synchronous version of `batch_extract_file()`\n - `batch_extract_bytes_sync()`: Synchronous version of `batch_extract_bytes()`\n\n### Configuration Parameters\n\nAll extraction functions accept the following optional parameters for configuring OCR and performance:\n\n#### OCR Configuration\n\n- `force_ocr`(default: `False`): Forces OCR processing even for searchable PDFs.\n- `language` (default: `eng`): Specifies the language model for Tesseract OCR. This affects text recognition accuracy for documents in different languages. Examples:\n\n - `eng` for English\n - `deu` for German\n - `eng+deu` for English and German\n\n Notes: - the order of languages effect processing time, the first language is the primary language and the second language is the secondary language etc.\n\n- `psm` (Page Segmentation Mode, default: `PSM.AUTO`): Controls how Tesseract analyzes page layout. In most cases you do not need to change this to a different value.\n\nConsult the [Tesseract documentation](https://tesseract-ocr.github.io/tessdoc/) for more information on both options.\n\n#### Processing Configuration\n\n- `max_processes` (default: CPU count): Maximum number of concurrent processes for Tesseract.\n\n### Quick Start\n\n```python\nfrom pathlib import Path\nfrom kreuzberg import extract_file\nfrom kreuzberg import ExtractionResult\nfrom kreuzberg import PSMMode\n\n\n# Basic file extraction\nasync def extract_document():\n # Extract from a PDF file with default settings\n pdf_result: ExtractionResult = await extract_file(\"document.pdf\")\n print(f\"Content: {pdf_result.content}\")\n\n # Extract from an image with German language model\n img_result = await extract_file(\n \"scan.png\",\n language=\"deu\", # German language model\n psm=PSMMode.SINGLE_BLOCK, # Treat as single block of text\n max_processes=4 # Limit concurrent processes\n )\n print(f\"Image text: {img_result.content}\")\n\n # Extract from Word document with metadata\n docx_result = await extract_file(Path(\"document.docx\"))\n if docx_result.metadata:\n print(f\"Title: {docx_result.metadata.get('title')}\")\n print(f\"Author: {docx_result.metadata.get('creator')}\")\n```\n\n### Extracting Bytes\n\n```python\nfrom kreuzberg import extract_bytes\nfrom kreuzberg import ExtractionResult\n\n\nasync def process_upload(file_content: bytes, mime_type: str) -> ExtractionResult:\n \"\"\"Process uploaded file content with known MIME type.\"\"\"\n return await extract_bytes(\n file_content,\n mime_type=mime_type,\n )\n\n\n# Example usage with different file types\nasync def handle_uploads(docx_bytes: bytes, pdf_bytes: bytes, image_bytes: bytes):\n # Process PDF upload\n pdf_result = await process_upload(pdf_bytes, mime_type=\"application/pdf\")\n print(f\"PDF content: {pdf_result.content}\")\n print(f\"PDF metadata: {pdf_result.metadata}\")\n\n # Process image upload (will use OCR)\n img_result = await process_upload(image_bytes, mime_type=\"image/jpeg\")\n print(f\"Image text: {img_result.content}\")\n\n # Process Word document upload\n docx_result = await process_upload(\n docx_bytes,\n mime_type=\"application/vnd.openxmlformats-officedocument.wordprocessingml.document\"\n )\n print(f\"Word content: {docx_result.content}\")\n```\n\n### Batch Processing\n\nKreuzberg supports efficient batch processing of multiple files or byte contents:\n\n```python\nfrom pathlib import Path\nfrom kreuzberg import batch_extract_file, batch_extract_bytes, batch_extract_file_sync\n\n\n# Process multiple files concurrently\nasync def process_documents(file_paths: list[Path]) -> None:\n # Extract from multiple files\n results = await batch_extract_file(file_paths)\n for path, result in zip(file_paths, results):\n print(f\"File {path}: {result.content[:100]}...\")\n\n\n# Process multiple uploaded files concurrently\nasync def process_uploads(contents: list[tuple[bytes, str]]) -> None:\n # Each item is a tuple of (content, mime_type)\n results = await batch_extract_bytes(contents)\n for (_, mime_type), result in zip(contents, results):\n print(f\"Upload {mime_type}: {result.content[:100]}...\")\n\n\n# Synchronous batch processing is also available\ndef process_documents_sync(file_paths: list[Path]) -> None:\n results = batch_extract_file_sync(file_paths)\n for path, result in zip(file_paths, results):\n print(f\"File {path}: {result.content[:100]}...\")\n```\n\nFeatures:\n\n- Ordered results\n- Concurrent processing\n- Error handling per item\n- Async and sync interfaces\n- Same options as single extraction\n\n### PDF Processing\n\nKreuzberg employs a smart approach to PDF text extraction:\n\n1. **Searchable Text Detection**: First attempts to extract text directly from searchable PDFs using `pdfium2`.\n\n2. **Text Validation**: Extracted text is validated for corruption by checking for:\n\n - Control and non-printable characters\n - Unicode replacement characters (\ufffd)\n - Zero-width spaces and other invisible characters\n - Empty or whitespace-only content\n\n3. **Automatic OCR Fallback**: If the extracted text appears corrupted or if the PDF is image-based, automatically falls back to OCR using Tesseract.\n\nThis approach works well for searchable PDFs and standard text documents. For complex OCR (e.g., handwriting, photographs), use a specialized tool.\n\n### PDF Processing Options\n\nYou can control PDF processing behavior using optional parameters:\n\n```python\nfrom kreuzberg import extract_file\n\n\nasync def process_pdf():\n # Default behavior: auto-detect and use OCR if needed\n # By default, max_processes=1 for safe operation\n result = await extract_file(\"document.pdf\")\n print(result.content)\n\n # Force OCR even for searchable PDFs\n result = await extract_file(\"document.pdf\", force_ocr=True)\n print(result.content)\n\n # Control OCR concurrency for large documents\n # Warning: High concurrency values can cause system resource exhaustion\n # Start with a low value and increase based on your system's capabilities\n result = await extract_file(\n \"large_document.pdf\",\n max_processes=4 # Process up to 4 pages concurrently\n )\n print(result.content)\n\n # Process a scanned PDF (automatically uses OCR)\n result = await extract_file(\"scanned.pdf\")\n print(result.content)\n```\n\n### ExtractionResult Object\n\nAll extraction functions return an `ExtractionResult` or a list thereof (for batch functions). The `ExtractionResult` object is a `NamedTuple`:\n\n- `content`: The extracted text (str)\n- `mime_type`: Output format (\"text/plain\" or \"text/markdown\" for Pandoc conversions)\n- `metadata`: A metadata dictionary. Currently this dictionary is only populated when extracting documents using pandoc.\n\n```python\nfrom kreuzberg import extract_file, ExtractionResult, Metadata\n\nasync def process_document(path: str) -> tuple[str, str, Metadata]:\n # Access as a named tuple\n result: ExtractionResult = await extract_file(path)\n print(f\"Content: {result.content}\")\n print(f\"Format: {result.mime_type}\")\n\n # Or unpack as a tuple\n content, mime_type, metadata = await extract_file(path)\n return content, mime_type, metadata\n```\n\n### Error Handling\n\nKreuzberg provides comprehensive error handling through several exception types, all inheriting from `KreuzbergError`. Each exception includes helpful context information for debugging.\n\n```python\nfrom kreuzberg import (\n extract_file,\n ValidationError,\n ParsingError,\n OCRError,\n MissingDependencyError\n)\n\nasync def safe_extract(path: str) -> str:\n try:\n result = await extract_file(path)\n return result.content\n\n except ValidationError as e:\n # Input validation issues\n # - Unsupported or undetectable MIME types\n # - Missing files\n # - Invalid input parameters\n print(f\"Validation failed: {e}\")\n\n except OCRError as e:\n # OCR-specific issues\n # - Tesseract processing failures\n # - Image conversion problems\n print(f\"OCR failed: {e}\")\n\n except MissingDependencyError as e:\n # System dependency issues\n # - Missing Tesseract OCR\n # - Missing Pandoc\n # - Incompatible versions\n print(f\"Dependency missing: {e}\")\n\n except ParsingError as e:\n # General processing errors\n # - PDF parsing failures\n # - Format conversion issues\n # - Encoding problems\n print(f\"Processing failed: {e}\")\n\n return \"\"\n```\n\nAll exceptions include:\n\n- Error message\n- Context in the `context` attribute\n- String representation\n- Exception chaining\n\n## Contribution\n\nThis library is open to contribution. Feel free to open issues or submit PRs. Its better to discuss issues before\nsubmitting PRs to avoid disappointment.\n\n### Local Development\n\n1. Clone the repo\n2. Install the system dependencies\n3. Install the full dependencies with `uv sync`\n4. Install the pre-commit hooks with:\n\n ```shell\n pre-commit install && pre-commit install --hook-type commit-msg\n ```\n\n5. Make your changes and submit a PR\n\n## License\n\nThis library uses the MIT license.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A text extraction library supporting PDFs, images, office documents and more",
"version": "2.1.0",
"project_urls": {
"homepage": "https://github.com/Goldziher/kreuzberg"
},
"split_keywords": [
"document-processing",
" image-to-text",
" ocr",
" pandoc",
" pdf-extraction",
" rag",
" tesseract",
" text-extraction",
" text-processing"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "69bda8f077dc157c00a64e1e51b35c697e02983070b6526d7042dfc11cb74003",
"md5": "54b59d126c16f5d7530ca038ae90660f",
"sha256": "fc7ac3d0f47f10783e4c1232b048a34a6827320e14148a219f98bd438c076c61"
},
"downloads": -1,
"filename": "kreuzberg-2.1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "54b59d126c16f5d7530ca038ae90660f",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 26902,
"upload_time": "2025-02-20T15:26:39",
"upload_time_iso_8601": "2025-02-20T15:26:39.086002Z",
"url": "https://files.pythonhosted.org/packages/69/bd/a8f077dc157c00a64e1e51b35c697e02983070b6526d7042dfc11cb74003/kreuzberg-2.1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "7a2a67cf078536ff58c3c46918e166e065896ace645da7c8716cb063f15f942e",
"md5": "ad03dc9654b966bc1eb8deba0140435d",
"sha256": "2f472b1d9c1f753652ecb9c849a47fd1fa6f283631c95e484d731d414991ceba"
},
"downloads": -1,
"filename": "kreuzberg-2.1.0.tar.gz",
"has_sig": false,
"md5_digest": "ad03dc9654b966bc1eb8deba0140435d",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 26300,
"upload_time": "2025-02-20T15:26:43",
"upload_time_iso_8601": "2025-02-20T15:26:43.054978Z",
"url": "https://files.pythonhosted.org/packages/7a/2a/67cf078536ff58c3c46918e166e065896ace645da7c8716cb063f15f942e/kreuzberg-2.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-02-20 15:26:43",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Goldziher",
"github_project": "kreuzberg",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "kreuzberg"
}