# Kreuzberg
Kreuzberg is a library for simplified text extraction from PDF files. It's meant to offer simple, hassle free text
extraction.
Why?
I am building, like many do now, a RAG focused service. I have text extraction needs.
There are quite a lot of commercial options out there, and several open-source + paid options.
But I wanted something simple, which does not require expansive round-trips to an external API.
Furthermore, I wanted something that is easy to run locally and isn't very heavy / requires a GPU.
Hence, this library.
## Features
- Extract text from PDFs, images, office documents and more (see supported formats below)
- Use modern Python with async (via `anyio`) and proper type hints
- Extensive error handling for easy debugging
## Installation
1. Begin by installing the python package:
```shell
pip install kreuzberg
```
2. Install the system dependencies:
- [pandoc](https://pandoc.org/installing.html) (non-pdf text extraction, GPL v2.0 licensed but used via CLI only)
- [tesseract-ocr](https://tesseract-ocr.github.io/) (for image/PDF OCR, Apache License)
## Supported File Types
Kreuzberg supports a wide range of file formats:
### Document Formats
- PDF (`.pdf`) - both searchable and scanned documents
- Word Documents (`.docx`)
- OpenDocument Text (`.odt`)
- Rich Text Format (`.rtf`)
### Image Formats
- JPEG, JPG (`.jpg`, `.jpeg`, `.pjpeg`)
- PNG (`.png`)
- TIFF (`.tiff`, `.tif`)
- BMP (`.bmp`)
- GIF (`.gif`)
- WebP (`.webp`)
- JPEG 2000 (`.jp2`, `.jpx`, `.jpm`, `.mj2`)
- Portable Anymap (`.pnm`)
- Portable Bitmap (`.pbm`)
- Portable Graymap (`.pgm`)
- Portable Pixmap (`.ppm`)
#### Text and Markup Formats
- Plain Text (`.txt`)
- Markdown (`.md`)
- reStructuredText (`.rst`)
- LaTeX (`.tex`)
#### Data Formats
- Comma-Separated Values (`.csv`)
- Tab-Separated Values (`.tsv`)
All formats support text extraction, with different processing methods:
- PDFs are processed using pdfium2 for searchable PDFs and Tesseract OCR for scanned documents
- Images are processed using Tesseract OCR
- Office documents and other formats are processed using Pandoc
- Plain text files are read directly with appropriate encoding detection
## Usage
Kreuzberg exports two async functions:
- Extract text from a file (string path or `pathlib.Path`) using `extract_file()`
- Extract text from a byte-string using `extract_bytes()`
Note - both of these functions are async and therefore should be used in an async context.
### Extract from File
```python
from pathlib import Path
from kreuzberg import extract_file
# Extract text from a PDF file
async def extract_pdf():
result = await extract_file("document.pdf")
print(f"Extracted text: {result.content}")
print(f"Output mime type: {result.mime_type}")
# Extract text from an image
async def extract_image():
result = await extract_file("scan.png")
print(f"Extracted text: {result.content}")
# or use Path
async def extract_pdf():
result = await extract_file(Path("document.pdf"))
print(f"Extracted text: {result.content}")
print(f"Output mime type: {result.mime_type}")
```
### Extract from Bytes
```python
from kreuzberg import extract_bytes
# Extract text from PDF bytes
async def process_uploaded_pdf(pdf_content: bytes):
result = await extract_bytes(pdf_content, mime_type="application/pdf")
return result.content
# Extract text from image bytes
async def process_uploaded_image(image_content: bytes):
result = await extract_bytes(image_content, mime_type="image/jpeg")
return result.content
```
### Forcing OCR
When extracting a PDF file or bytes, you might want to force OCR - for example, if the PDF includes images that have text that should be extracted etc.
You can do this by passing `force_ocr=True`:
```python
from kreuzberg import extract_bytes
# Extract text from PDF bytes and force OCR
async def process_uploaded_pdf(pdf_content: bytes):
result = await extract_bytes(pdf_content, mime_type="application/pdf", force_ocr=True)
return result.content
```
### Error Handling
Kreuzberg raises two exception types:
#### ValidationError
Raised when there are issues with input validation:
- Unsupported mime types
- Undetectable mime types
- Path doesn't point at an exist file
#### ParsingError
Raised when there are issues during the text extraction process:
- PDF parsing failures
- OCR errors
- Pandoc conversion errors
```python
from kreuzberg import extract_file
from kreuzberg.exceptions import ValidationError, ParsingError
async def safe_extract():
try:
result = await extract_file("document.doc")
return result.content
except ValidationError as e:
print(f"Validation error: {e.message}")
print(f"Context: {e.context}")
except ParsingError as e:
print(f"Parsing error: {e.message}")
print(f"Context: {e.context}") # Contains detailed error information
```
Both error types include helpful context information for debugging:
```python
try:
result = await extract_file("scanned.pdf")
except ParsingError as e:
# e.context might contain:
# {
# "file_path": "scanned.pdf",
# "error": "Tesseract OCR failed: Unable to process image"
# }
```
### ExtractionResult
All extraction functions return an ExtractionResult named tuple containing:
- `content`: The extracted text as a string
- `mime_type`: The mime type of the output (either "text/plain" or, if pandoc is used- "text/markdown")
```python
from kreuzberg import ExtractionResult
async def process_document(path: str) -> str:
result: ExtractionResult = await extract_file(path)
return result.content
# or access the result as tuple
async def process_document(path: str) -> str:
content, mime_type = await extract_file(path)
# do something with mime_type
return content
```
## Contribution
This library is open to contribution. Feel free to open issues or submit PRs. Its better to discuss issues before
submitting PRs to avoid disappointment.
### Local Development
1. Clone the repo
2. Install the system dependencies
3. Install the full dependencies with `uv sync`
4. Install the pre-commit hooks with:
```shell
pre-commit install && pre-commit install --hook-type commit-msg
```
5. Make your changes and submit a PR
## License
This library uses the MIT license.
Raw data
{
"_id": null,
"home_page": null,
"name": "kreuzberg",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "async, document-processing, docx, image-to-text, latex, markdown, ocr, odt, office-documents, pandoc, pdf, pdf-extraction, rag, tesseract, text-extraction, text-processing",
"author": null,
"author_email": "Na'aman Hirschfeld <nhirschfed@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/e8/f7/c646c43ed68e956fd1f565f489d8b39c4eae3ad8e57fb3d80d1bbe0a3d5f/kreuzberg-1.1.0.tar.gz",
"platform": null,
"description": "# Kreuzberg\n\nKreuzberg is a library for simplified text extraction from PDF files. It's meant to offer simple, hassle free text\nextraction.\n\nWhy?\n\nI am building, like many do now, a RAG focused service. I have text extraction needs.\nThere are quite a lot of commercial options out there, and several open-source + paid options.\nBut I wanted something simple, which does not require expansive round-trips to an external API.\nFurthermore, I wanted something that is easy to run locally and isn't very heavy / requires a GPU.\n\nHence, this library.\n\n## Features\n\n- Extract text from PDFs, images, office documents and more (see supported formats below)\n- Use modern Python with async (via `anyio`) and proper type hints\n- Extensive error handling for easy debugging\n\n## Installation\n\n1. Begin by installing the python package:\n\n ```shell\n\n pip install kreuzberg\n\n ```\n\n2. Install the system dependencies:\n\n- [pandoc](https://pandoc.org/installing.html) (non-pdf text extraction, GPL v2.0 licensed but used via CLI only)\n- [tesseract-ocr](https://tesseract-ocr.github.io/) (for image/PDF OCR, Apache License)\n\n## Supported File Types\n\nKreuzberg supports a wide range of file formats:\n\n### Document Formats\n\n- PDF (`.pdf`) - both searchable and scanned documents\n- Word Documents (`.docx`)\n- OpenDocument Text (`.odt`)\n- Rich Text Format (`.rtf`)\n\n### Image Formats\n\n- JPEG, JPG (`.jpg`, `.jpeg`, `.pjpeg`)\n- PNG (`.png`)\n- TIFF (`.tiff`, `.tif`)\n- BMP (`.bmp`)\n- GIF (`.gif`)\n- WebP (`.webp`)\n- JPEG 2000 (`.jp2`, `.jpx`, `.jpm`, `.mj2`)\n- Portable Anymap (`.pnm`)\n- Portable Bitmap (`.pbm`)\n- Portable Graymap (`.pgm`)\n- Portable Pixmap (`.ppm`)\n\n#### Text and Markup Formats\n\n- Plain Text (`.txt`)\n- Markdown (`.md`)\n- reStructuredText (`.rst`)\n- LaTeX (`.tex`)\n\n#### Data Formats\n\n- Comma-Separated Values (`.csv`)\n- Tab-Separated Values (`.tsv`)\n\nAll formats support text extraction, with different processing methods:\n\n- PDFs are processed using pdfium2 for searchable PDFs and Tesseract OCR for scanned documents\n- Images are processed using Tesseract OCR\n- Office documents and other formats are processed using Pandoc\n- Plain text files are read directly with appropriate encoding detection\n\n## Usage\n\nKreuzberg exports two async functions:\n\n- Extract text from a file (string path or `pathlib.Path`) using `extract_file()`\n- Extract text from a byte-string using `extract_bytes()`\n\nNote - both of these functions are async and therefore should be used in an async context.\n\n### Extract from File\n\n```python\nfrom pathlib import Path\nfrom kreuzberg import extract_file\n\n\n# Extract text from a PDF file\nasync def extract_pdf():\n result = await extract_file(\"document.pdf\")\n print(f\"Extracted text: {result.content}\")\n print(f\"Output mime type: {result.mime_type}\")\n\n\n# Extract text from an image\nasync def extract_image():\n result = await extract_file(\"scan.png\")\n print(f\"Extracted text: {result.content}\")\n\n\n# or use Path\n\nasync def extract_pdf():\n result = await extract_file(Path(\"document.pdf\"))\n print(f\"Extracted text: {result.content}\")\n print(f\"Output mime type: {result.mime_type}\")\n```\n\n### Extract from Bytes\n\n```python\nfrom kreuzberg import extract_bytes\n\n\n# Extract text from PDF bytes\nasync def process_uploaded_pdf(pdf_content: bytes):\n result = await extract_bytes(pdf_content, mime_type=\"application/pdf\")\n return result.content\n\n\n# Extract text from image bytes\nasync def process_uploaded_image(image_content: bytes):\n result = await extract_bytes(image_content, mime_type=\"image/jpeg\")\n return result.content\n```\n\n### Forcing OCR\n\nWhen extracting a PDF file or bytes, you might want to force OCR - for example, if the PDF includes images that have text that should be extracted etc.\nYou can do this by passing `force_ocr=True`:\n\n```python\nfrom kreuzberg import extract_bytes\n\n\n# Extract text from PDF bytes and force OCR\nasync def process_uploaded_pdf(pdf_content: bytes):\n result = await extract_bytes(pdf_content, mime_type=\"application/pdf\", force_ocr=True)\n return result.content\n```\n\n### Error Handling\n\nKreuzberg raises two exception types:\n\n#### ValidationError\n\nRaised when there are issues with input validation:\n\n- Unsupported mime types\n- Undetectable mime types\n- Path doesn't point at an exist file\n\n#### ParsingError\n\nRaised when there are issues during the text extraction process:\n\n- PDF parsing failures\n- OCR errors\n- Pandoc conversion errors\n\n```python\nfrom kreuzberg import extract_file\nfrom kreuzberg.exceptions import ValidationError, ParsingError\n\n\nasync def safe_extract():\n try:\n result = await extract_file(\"document.doc\")\n return result.content\n except ValidationError as e:\n print(f\"Validation error: {e.message}\")\n print(f\"Context: {e.context}\")\n except ParsingError as e:\n print(f\"Parsing error: {e.message}\")\n print(f\"Context: {e.context}\") # Contains detailed error information\n```\n\nBoth error types include helpful context information for debugging:\n\n```python\ntry:\n result = await extract_file(\"scanned.pdf\")\nexcept ParsingError as e:\n# e.context might contain:\n# {\n# \"file_path\": \"scanned.pdf\",\n# \"error\": \"Tesseract OCR failed: Unable to process image\"\n# }\n```\n\n### ExtractionResult\n\nAll extraction functions return an ExtractionResult named tuple containing:\n\n- `content`: The extracted text as a string\n- `mime_type`: The mime type of the output (either \"text/plain\" or, if pandoc is used- \"text/markdown\")\n\n```python\nfrom kreuzberg import ExtractionResult\n\n\nasync def process_document(path: str) -> str:\n result: ExtractionResult = await extract_file(path)\n return result.content\n\n\n# or access the result as tuple\n\nasync def process_document(path: str) -> str:\n content, mime_type = await extract_file(path)\n # do something with mime_type\n return content\n```\n\n## Contribution\n\nThis library is open to contribution. Feel free to open issues or submit PRs. Its better to discuss issues before\nsubmitting PRs to avoid disappointment.\n\n### Local Development\n\n1. Clone the repo\n2. Install the system dependencies\n3. Install the full dependencies with `uv sync`\n4. Install the pre-commit hooks with:\n ```shell\n pre-commit install && pre-commit install --hook-type commit-msg\n ```\n5. Make your changes and submit a PR\n\n## License\n\nThis library uses the MIT license.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A text extraction library supporting PDFs, images, office documents and more",
"version": "1.1.0",
"project_urls": {
"homepage": "https://github.com/Goldziher/kreuzberg"
},
"split_keywords": [
"async",
" document-processing",
" docx",
" image-to-text",
" latex",
" markdown",
" ocr",
" odt",
" office-documents",
" pandoc",
" pdf",
" pdf-extraction",
" rag",
" tesseract",
" text-extraction",
" text-processing"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "aea4b9d00ee6f516322094937a4003910754997ea79f61ab33f0aa6ac3ee45e8",
"md5": "93e86a934fdfa8d403dcbe5b3fa4380b",
"sha256": "db0bf266a80cdf9be12f5187878a3a90b48cf7b156a8566157658a146d68acc6"
},
"downloads": -1,
"filename": "kreuzberg-1.1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "93e86a934fdfa8d403dcbe5b3fa4380b",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 10122,
"upload_time": "2025-02-01T17:20:49",
"upload_time_iso_8601": "2025-02-01T17:20:49.745037Z",
"url": "https://files.pythonhosted.org/packages/ae/a4/b9d00ee6f516322094937a4003910754997ea79f61ab33f0aa6ac3ee45e8/kreuzberg-1.1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "e8f7c646c43ed68e956fd1f565f489d8b39c4eae3ad8e57fb3d80d1bbe0a3d5f",
"md5": "f4eb1cc351fed32c4f6230fa1df09f95",
"sha256": "74e9f6d0f1ba26d7865ef570f9c3ea3952edeb3138a549b0ae43e8a711d84f55"
},
"downloads": -1,
"filename": "kreuzberg-1.1.0.tar.gz",
"has_sig": false,
"md5_digest": "f4eb1cc351fed32c4f6230fa1df09f95",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 11738,
"upload_time": "2025-02-01T17:20:54",
"upload_time_iso_8601": "2025-02-01T17:20:54.815104Z",
"url": "https://files.pythonhosted.org/packages/e8/f7/c646c43ed68e956fd1f565f489d8b39c4eae3ad8e57fb3d80d1bbe0a3d5f/kreuzberg-1.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-02-01 17:20:54",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Goldziher",
"github_project": "kreuzberg",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "kreuzberg"
}