kreuzberg


Namekreuzberg JSON
Version 1.1.0 PyPI version JSON
download
home_pageNone
SummaryA text extraction library supporting PDFs, images, office documents and more
upload_time2025-02-01 17:20:54
maintainerNone
docs_urlNone
authorNone
requires_python>=3.9
licenseMIT
keywords async document-processing docx image-to-text latex markdown ocr odt office-documents pandoc pdf pdf-extraction rag tesseract text-extraction text-processing
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Kreuzberg

Kreuzberg is a library for simplified text extraction from PDF files. It's meant to offer simple, hassle free text
extraction.

Why?

I am building, like many do now, a RAG focused service. I have text extraction needs.
There are quite a lot of commercial options out there, and several open-source + paid options.
But I wanted something simple, which does not require expansive round-trips to an external API.
Furthermore, I wanted something that is easy to run locally and isn't very heavy / requires a GPU.

Hence, this library.

## Features

- Extract text from PDFs, images, office documents and more (see supported formats below)
- Use modern Python with async (via `anyio`) and proper type hints
- Extensive error handling for easy debugging

## Installation

1. Begin by installing the python package:

   ```shell

   pip install kreuzberg

   ```

2. Install the system dependencies:

- [pandoc](https://pandoc.org/installing.html) (non-pdf text extraction, GPL v2.0 licensed but used via CLI only)
- [tesseract-ocr](https://tesseract-ocr.github.io/) (for image/PDF OCR, Apache License)

## Supported File Types

Kreuzberg supports a wide range of file formats:

### Document Formats

- PDF (`.pdf`) - both searchable and scanned documents
- Word Documents (`.docx`)
- OpenDocument Text (`.odt`)
- Rich Text Format (`.rtf`)

### Image Formats

- JPEG, JPG (`.jpg`, `.jpeg`, `.pjpeg`)
- PNG (`.png`)
- TIFF (`.tiff`, `.tif`)
- BMP (`.bmp`)
- GIF (`.gif`)
- WebP (`.webp`)
- JPEG 2000 (`.jp2`, `.jpx`, `.jpm`, `.mj2`)
- Portable Anymap (`.pnm`)
- Portable Bitmap (`.pbm`)
- Portable Graymap (`.pgm`)
- Portable Pixmap (`.ppm`)

#### Text and Markup Formats

- Plain Text (`.txt`)
- Markdown (`.md`)
- reStructuredText (`.rst`)
- LaTeX (`.tex`)

#### Data Formats

- Comma-Separated Values (`.csv`)
- Tab-Separated Values (`.tsv`)

All formats support text extraction, with different processing methods:

- PDFs are processed using pdfium2 for searchable PDFs and Tesseract OCR for scanned documents
- Images are processed using Tesseract OCR
- Office documents and other formats are processed using Pandoc
- Plain text files are read directly with appropriate encoding detection

## Usage

Kreuzberg exports two async functions:

- Extract text from a file (string path or `pathlib.Path`) using `extract_file()`
- Extract text from a byte-string using `extract_bytes()`

Note - both of these functions are async and therefore should be used in an async context.

### Extract from File

```python
from pathlib import Path
from kreuzberg import extract_file


# Extract text from a PDF file
async def extract_pdf():
    result = await extract_file("document.pdf")
    print(f"Extracted text: {result.content}")
    print(f"Output mime type: {result.mime_type}")


# Extract text from an image
async def extract_image():
    result = await extract_file("scan.png")
    print(f"Extracted text: {result.content}")


# or use Path

async def extract_pdf():
    result = await extract_file(Path("document.pdf"))
    print(f"Extracted text: {result.content}")
    print(f"Output mime type: {result.mime_type}")
```

### Extract from Bytes

```python
from kreuzberg import extract_bytes


# Extract text from PDF bytes
async def process_uploaded_pdf(pdf_content: bytes):
    result = await extract_bytes(pdf_content, mime_type="application/pdf")
    return result.content


# Extract text from image bytes
async def process_uploaded_image(image_content: bytes):
    result = await extract_bytes(image_content, mime_type="image/jpeg")
    return result.content
```

### Forcing OCR

When extracting a PDF file or bytes, you might want to force OCR - for example, if the PDF includes images that have text that should be extracted etc.
You can do this by passing `force_ocr=True`:

```python
from kreuzberg import extract_bytes


# Extract text from PDF bytes and force OCR
async def process_uploaded_pdf(pdf_content: bytes):
    result = await extract_bytes(pdf_content, mime_type="application/pdf", force_ocr=True)
    return result.content
```

### Error Handling

Kreuzberg raises two exception types:

#### ValidationError

Raised when there are issues with input validation:

- Unsupported mime types
- Undetectable mime types
- Path doesn't point at an exist file

#### ParsingError

Raised when there are issues during the text extraction process:

- PDF parsing failures
- OCR errors
- Pandoc conversion errors

```python
from kreuzberg import extract_file
from kreuzberg.exceptions import ValidationError, ParsingError


async def safe_extract():
    try:
        result = await extract_file("document.doc")
        return result.content
    except ValidationError as e:
        print(f"Validation error: {e.message}")
        print(f"Context: {e.context}")
    except ParsingError as e:
        print(f"Parsing error: {e.message}")
        print(f"Context: {e.context}")  # Contains detailed error information
```

Both error types include helpful context information for debugging:

```python
try:
    result = await extract_file("scanned.pdf")
except ParsingError as e:
# e.context might contain:
# {
#    "file_path": "scanned.pdf",
#    "error": "Tesseract OCR failed: Unable to process image"
# }
```

### ExtractionResult

All extraction functions return an ExtractionResult named tuple containing:

- `content`: The extracted text as a string
- `mime_type`: The mime type of the output (either "text/plain" or, if pandoc is used- "text/markdown")

```python
from kreuzberg import ExtractionResult


async def process_document(path: str) -> str:
    result: ExtractionResult = await extract_file(path)
    return result.content


# or access the result as tuple

async def process_document(path: str) -> str:
    content, mime_type = await extract_file(path)
    # do something with mime_type
    return content
```

## Contribution

This library is open to contribution. Feel free to open issues or submit PRs. Its better to discuss issues before
submitting PRs to avoid disappointment.

### Local Development

1. Clone the repo
2. Install the system dependencies
3. Install the full dependencies with `uv sync`
4. Install the pre-commit hooks with:
   ```shell
   pre-commit install && pre-commit install --hook-type commit-msg
   ```
5. Make your changes and submit a PR

## License

This library uses the MIT license.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "kreuzberg",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "async, document-processing, docx, image-to-text, latex, markdown, ocr, odt, office-documents, pandoc, pdf, pdf-extraction, rag, tesseract, text-extraction, text-processing",
    "author": null,
    "author_email": "Na'aman Hirschfeld <nhirschfed@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/e8/f7/c646c43ed68e956fd1f565f489d8b39c4eae3ad8e57fb3d80d1bbe0a3d5f/kreuzberg-1.1.0.tar.gz",
    "platform": null,
    "description": "# Kreuzberg\n\nKreuzberg is a library for simplified text extraction from PDF files. It's meant to offer simple, hassle free text\nextraction.\n\nWhy?\n\nI am building, like many do now, a RAG focused service. I have text extraction needs.\nThere are quite a lot of commercial options out there, and several open-source + paid options.\nBut I wanted something simple, which does not require expansive round-trips to an external API.\nFurthermore, I wanted something that is easy to run locally and isn't very heavy / requires a GPU.\n\nHence, this library.\n\n## Features\n\n- Extract text from PDFs, images, office documents and more (see supported formats below)\n- Use modern Python with async (via `anyio`) and proper type hints\n- Extensive error handling for easy debugging\n\n## Installation\n\n1. Begin by installing the python package:\n\n   ```shell\n\n   pip install kreuzberg\n\n   ```\n\n2. Install the system dependencies:\n\n- [pandoc](https://pandoc.org/installing.html) (non-pdf text extraction, GPL v2.0 licensed but used via CLI only)\n- [tesseract-ocr](https://tesseract-ocr.github.io/) (for image/PDF OCR, Apache License)\n\n## Supported File Types\n\nKreuzberg supports a wide range of file formats:\n\n### Document Formats\n\n- PDF (`.pdf`) - both searchable and scanned documents\n- Word Documents (`.docx`)\n- OpenDocument Text (`.odt`)\n- Rich Text Format (`.rtf`)\n\n### Image Formats\n\n- JPEG, JPG (`.jpg`, `.jpeg`, `.pjpeg`)\n- PNG (`.png`)\n- TIFF (`.tiff`, `.tif`)\n- BMP (`.bmp`)\n- GIF (`.gif`)\n- WebP (`.webp`)\n- JPEG 2000 (`.jp2`, `.jpx`, `.jpm`, `.mj2`)\n- Portable Anymap (`.pnm`)\n- Portable Bitmap (`.pbm`)\n- Portable Graymap (`.pgm`)\n- Portable Pixmap (`.ppm`)\n\n#### Text and Markup Formats\n\n- Plain Text (`.txt`)\n- Markdown (`.md`)\n- reStructuredText (`.rst`)\n- LaTeX (`.tex`)\n\n#### Data Formats\n\n- Comma-Separated Values (`.csv`)\n- Tab-Separated Values (`.tsv`)\n\nAll formats support text extraction, with different processing methods:\n\n- PDFs are processed using pdfium2 for searchable PDFs and Tesseract OCR for scanned documents\n- Images are processed using Tesseract OCR\n- Office documents and other formats are processed using Pandoc\n- Plain text files are read directly with appropriate encoding detection\n\n## Usage\n\nKreuzberg exports two async functions:\n\n- Extract text from a file (string path or `pathlib.Path`) using `extract_file()`\n- Extract text from a byte-string using `extract_bytes()`\n\nNote - both of these functions are async and therefore should be used in an async context.\n\n### Extract from File\n\n```python\nfrom pathlib import Path\nfrom kreuzberg import extract_file\n\n\n# Extract text from a PDF file\nasync def extract_pdf():\n    result = await extract_file(\"document.pdf\")\n    print(f\"Extracted text: {result.content}\")\n    print(f\"Output mime type: {result.mime_type}\")\n\n\n# Extract text from an image\nasync def extract_image():\n    result = await extract_file(\"scan.png\")\n    print(f\"Extracted text: {result.content}\")\n\n\n# or use Path\n\nasync def extract_pdf():\n    result = await extract_file(Path(\"document.pdf\"))\n    print(f\"Extracted text: {result.content}\")\n    print(f\"Output mime type: {result.mime_type}\")\n```\n\n### Extract from Bytes\n\n```python\nfrom kreuzberg import extract_bytes\n\n\n# Extract text from PDF bytes\nasync def process_uploaded_pdf(pdf_content: bytes):\n    result = await extract_bytes(pdf_content, mime_type=\"application/pdf\")\n    return result.content\n\n\n# Extract text from image bytes\nasync def process_uploaded_image(image_content: bytes):\n    result = await extract_bytes(image_content, mime_type=\"image/jpeg\")\n    return result.content\n```\n\n### Forcing OCR\n\nWhen extracting a PDF file or bytes, you might want to force OCR - for example, if the PDF includes images that have text that should be extracted etc.\nYou can do this by passing `force_ocr=True`:\n\n```python\nfrom kreuzberg import extract_bytes\n\n\n# Extract text from PDF bytes and force OCR\nasync def process_uploaded_pdf(pdf_content: bytes):\n    result = await extract_bytes(pdf_content, mime_type=\"application/pdf\", force_ocr=True)\n    return result.content\n```\n\n### Error Handling\n\nKreuzberg raises two exception types:\n\n#### ValidationError\n\nRaised when there are issues with input validation:\n\n- Unsupported mime types\n- Undetectable mime types\n- Path doesn't point at an exist file\n\n#### ParsingError\n\nRaised when there are issues during the text extraction process:\n\n- PDF parsing failures\n- OCR errors\n- Pandoc conversion errors\n\n```python\nfrom kreuzberg import extract_file\nfrom kreuzberg.exceptions import ValidationError, ParsingError\n\n\nasync def safe_extract():\n    try:\n        result = await extract_file(\"document.doc\")\n        return result.content\n    except ValidationError as e:\n        print(f\"Validation error: {e.message}\")\n        print(f\"Context: {e.context}\")\n    except ParsingError as e:\n        print(f\"Parsing error: {e.message}\")\n        print(f\"Context: {e.context}\")  # Contains detailed error information\n```\n\nBoth error types include helpful context information for debugging:\n\n```python\ntry:\n    result = await extract_file(\"scanned.pdf\")\nexcept ParsingError as e:\n# e.context might contain:\n# {\n#    \"file_path\": \"scanned.pdf\",\n#    \"error\": \"Tesseract OCR failed: Unable to process image\"\n# }\n```\n\n### ExtractionResult\n\nAll extraction functions return an ExtractionResult named tuple containing:\n\n- `content`: The extracted text as a string\n- `mime_type`: The mime type of the output (either \"text/plain\" or, if pandoc is used- \"text/markdown\")\n\n```python\nfrom kreuzberg import ExtractionResult\n\n\nasync def process_document(path: str) -> str:\n    result: ExtractionResult = await extract_file(path)\n    return result.content\n\n\n# or access the result as tuple\n\nasync def process_document(path: str) -> str:\n    content, mime_type = await extract_file(path)\n    # do something with mime_type\n    return content\n```\n\n## Contribution\n\nThis library is open to contribution. Feel free to open issues or submit PRs. Its better to discuss issues before\nsubmitting PRs to avoid disappointment.\n\n### Local Development\n\n1. Clone the repo\n2. Install the system dependencies\n3. Install the full dependencies with `uv sync`\n4. Install the pre-commit hooks with:\n   ```shell\n   pre-commit install && pre-commit install --hook-type commit-msg\n   ```\n5. Make your changes and submit a PR\n\n## License\n\nThis library uses the MIT license.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A text extraction library supporting PDFs, images, office documents and more",
    "version": "1.1.0",
    "project_urls": {
        "homepage": "https://github.com/Goldziher/kreuzberg"
    },
    "split_keywords": [
        "async",
        " document-processing",
        " docx",
        " image-to-text",
        " latex",
        " markdown",
        " ocr",
        " odt",
        " office-documents",
        " pandoc",
        " pdf",
        " pdf-extraction",
        " rag",
        " tesseract",
        " text-extraction",
        " text-processing"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "aea4b9d00ee6f516322094937a4003910754997ea79f61ab33f0aa6ac3ee45e8",
                "md5": "93e86a934fdfa8d403dcbe5b3fa4380b",
                "sha256": "db0bf266a80cdf9be12f5187878a3a90b48cf7b156a8566157658a146d68acc6"
            },
            "downloads": -1,
            "filename": "kreuzberg-1.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "93e86a934fdfa8d403dcbe5b3fa4380b",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 10122,
            "upload_time": "2025-02-01T17:20:49",
            "upload_time_iso_8601": "2025-02-01T17:20:49.745037Z",
            "url": "https://files.pythonhosted.org/packages/ae/a4/b9d00ee6f516322094937a4003910754997ea79f61ab33f0aa6ac3ee45e8/kreuzberg-1.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "e8f7c646c43ed68e956fd1f565f489d8b39c4eae3ad8e57fb3d80d1bbe0a3d5f",
                "md5": "f4eb1cc351fed32c4f6230fa1df09f95",
                "sha256": "74e9f6d0f1ba26d7865ef570f9c3ea3952edeb3138a549b0ae43e8a711d84f55"
            },
            "downloads": -1,
            "filename": "kreuzberg-1.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "f4eb1cc351fed32c4f6230fa1df09f95",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 11738,
            "upload_time": "2025-02-01T17:20:54",
            "upload_time_iso_8601": "2025-02-01T17:20:54.815104Z",
            "url": "https://files.pythonhosted.org/packages/e8/f7/c646c43ed68e956fd1f565f489d8b39c4eae3ad8e57fb3d80d1bbe0a3d5f/kreuzberg-1.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-02-01 17:20:54",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Goldziher",
    "github_project": "kreuzberg",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "kreuzberg"
}
        
Elapsed time: 1.45156s