kiwi-pdf-chunker


Namekiwi-pdf-chunker JSON
Version 0.3.3 PyPI version JSON
download
home_pageNone
SummaryA tool for parsing PDF document layouts and chunking content
upload_time2025-10-08 10:31:42
maintainerNone
docs_urlNone
authorNone
requires_python>=3.10
licenseMIT
keywords pdf ocr parsing document layout detection chunking cv
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # CV Document Chunker

A Python package for parsing PDF document layouts using YOLO models, chunking content based on layout, performing  table parsing with Docling integration, and optionally performing OCR.

## Features

- Convert PDF documents to images for processing.
- Detect document layout elements (e.g., paragraphs, tables, figures) using YOLO.
- Process and refine bounding boxes.
- Chunk document content based on detected layout.
- Intelligent table detection and parsing using Docling integration.
- Generate table summaries, identify key columns, and classify table content.
- Perform OCR on detected elements using Azure Document Intelligence or Tesseract.
- Save structured document data (layouts, chunks, OCR text, parsed tables) in JSON format.
- Generate paragraph embeddings using OpenAI, Azure OpenAI, or Hugging Face APIs.
- Build document hierarchy based on layout analysis.

## Installation

### Prerequisites

- Python 3.10+
- Pip package manager
- (Optional but Recommended) CUDA-capable GPU for YOLO model inference acceleration.

### Steps

1.  **Install the Package:**
    ```bash
    pip install kiwi-pdf-chunker
    ```

2. **Install models locally:**

You must download the required models **once** to your machine. The library will then automatically find them in the default location (`~/.kiwi_pdf_chunker/`).

First, run these shell commands to create the directories and download the `docling` models:

```bash
# Create directories for all models
mkdir -p ~/.kiwi_pdf_chunker/docling_models
mkdir -p ~/.kiwi_pdf_chunker/easyocr_models

# Download Docling layout model
python -m docling.cli.models download layout -o ~/.kiwi_pdf_chunker/docling_models

# Download Docling table structure model (TableFormer)
python -m docling.cli.models download tableformer -o ~/.kiwi_pdf_chunker/docling_models
```

Next, run this short Python script to download and cache the `EasyOCR` models:

```python
import easyocr
import os

# Define the target directory for EasyOCR models
target_dir = os.path.expanduser("~/.kiwi_pdf_chunker/easyocr_models")

# Initialize the reader, which will automatically download the models to the target directory
print("Downloading EasyOCR models...")
easyocr.Reader(
    ['en'],
    gpu=False,
    model_storage_directory=target_dir,
    download_enabled=True
)
print(f"EasyOCR models successfully downloaded to: {target_dir}")
```

You should end up with this directory structure in your home folder:

```
~/.kiwi_pdf_chunker/
├── docling_models/
│   ├── ... (docling model files)
└── easyocr_models/
    ├── craft_mlt_25k.pth
    ├── english_g2.pth
    └── dict/
        └── en.txt
```

By default, the package looks for models in the `~/.kiwi_pdf_chunker/` directory. You can override these paths using environment variables if you need to store the models elsewhere:

```bash
export DOCLING_ARTIFACTS_PATH=/path/to/your/docling_models
export EASYOCR_MODULE_PATH=/path/to/your/easyocr_models
```

## User-Provided Data

This package requires the user to provide certain data externally:

1.  **Input Directory (`input/`):** Place the PDF documents you want to process in a directory (e.g., `input/`). You will need to provide the path to your input file(s) when using the package.
2.  **Models Directory (`models/`):** Download the necessary YOLO model(s) (e.g., `doclayout_yolo_docstructbench_imgsz1024.pt`) and place them in a dedicated directory (e.g., `models/`). The path to this directory (or the specific model file) will be needed by the parser.

## Usage

**Basic Usage:**

```python
from kiwi_pdf_chunker.main import PDFParser

# --- User Configuration ---
input_pdf_path = "path/to/your/input/document.pdf" # Path to user's PDF
model_path = "path/to/your/models/doclayout_yolo.pt" # Path to user's model
output_dir = "path/to/your/output/" # Directory to save results

# Basic parser with OCR
parser = PDFParser(
    yolo_model_path=model_path,
    ocr=True,
    azure_ocr_endpoint="your_azure_endpoint",
    azure_ocr_key="your_azure_key"
)

results = parser.parse_document(input_pdf_path, output_dir=output_dir)
```

**Advanced Usage with Table Classification and Embeddings:**

```python
from kiwi_pdf_chunker.main import PDFParser

# Advanced parser with table classification and embeddings
parser = PDFParser(
    yolo_model_path=model_path,
    ocr=True,
    azure_ocr_endpoint="your_azure_endpoint",
    azure_ocr_key="your_azure_key",
    embed=True,
    classify_tables=True,
    hf_token="your_huggingface_token",  # For embeddings
    hf_endpoint="your_huggingface_endpoint",
    azure_openai_api_key="your_azure_openai_key",  # For table classification
    azure_openai_api_version="2024-02-15-preview",
    azure_openai_endpoint="your_azure_openai_endpoint"
)

results = parser.parse_document(
    input_pdf_path, 
    output_dir=output_dir,
    generate_annotations=True,
    save_bounding_boxes=True,
    use_tesseract=False  # Use Azure OCR instead of Tesseract
)
```

## Constructor Parameters

The `PDFParser` class accepts the following parameters:

### Core Parameters
- `yolo_model_path` (str, optional): Path to the YOLO model file. If None, uses the default path from config.
- `debug_mode` (bool, optional): Enable debug mode with additional logging and outputs. Defaults to False.
- `container_threshold` (int, optional): Minimum number of contained boxes required to remove a container box.
- `hierarchy` (bool, optional): Enable hierarchy generation. Defaults to True.

### OCR Parameters
- `ocr` (bool, optional): Enable OCR processing. Defaults to False.
- `azure_ocr_endpoint` (str, optional): Azure Document Intelligence endpoint URL.
- `azure_ocr_key` (str, optional): Azure Document Intelligence API key.

### Embedding Parameters
- `embed` (bool, optional): If True, generate embeddings for extracted text. Defaults to False.
- `embedding_model` (str, optional): Name of the OpenAI model for embeddings. Defaults to "text-embedding-3-small".
- `openai_api_key` (str, optional): API key for standard OpenAI service.
- `azure_openai_api_key` (str, optional): API key for Azure OpenAI service.
- `azure_openai_api_version` (str, optional): API version for Azure OpenAI service.
- `azure_openai_endpoint_embedding` (str, optional): Endpoint URL for Azure OpenAI service for text embeddings.
- `hf_token` (str, optional): Hugging Face API token for embeddings.
- `hf_endpoint` (str, optional): Hugging Face endpoint URL for embeddings.

### Table Classification Parameters
- `classify_tables` (bool, optional): If True, classify tables in the document. Defaults to False.
- `table_categories` (list, optional): List of table categories for classification.
- `azure_openai_endpoint` (str, optional): Endpoint URL for Azure OpenAI service for table classification.
- `table_classification_system_prompt` (str, optional): System prompt for table classification.
- `table_summary_system_prompt` (str, optional): System prompt for table summary generation.
- `table_id_column_system_prompt` (str, optional): System prompt for table ID column identification.

## Understanding the Output

After running the parser, the following outputs will typically be available in the specified `output_dir`:

1.  **`boxes.json`**: JSON file containing the detected document structure (element labels, coordinates, confidence).
2.  **`tables.json`**: JSON file containing parsed table data with hierarchical structure for "good" tables and vision model output for "bad" tables.
3.  **`table_screenshots/`**: Directory containing screenshots of detected tables for debugging and verification.
4.  **`annotations/`**: Directory containing annotated images showing the detected elements for each page (if `generate_annotations=True`).
5.  **`boxes/`**: Directory containing individual images for each detected element, organized by page number (if `save_bounding_boxes=True`). This is required for OCR.
6.  **`text.json`**: (Only if `ocr=True`) JSON file containing the extracted text for each element, sorted according to the structure defined in `boxes.json`.
7.  **`embeddings.json`**: (Only if `embed=True`) JSON file containing embeddings for each text element.
8.  **`hierarchy.json`**: (Only if `hierarchy=True`) JSON file containing the document hierarchy structure.

### Table Parsing Features

This package integrates [Docling](https://github.com/docling-project/docling) for PDF parsing.  
It is configured to run entirely offline with the required models bundled inside the package.

The library provides advanced table parsing capabilities:

- **Automatic Table Detection**: Uses Docling integration for superior table detection.
- **Table Structure Classification**: Automatically classifies tables as "good" (hierarchically parseable) or "bad" (requiring vision models).
- **Table Parsing**: For "good" tables, builds hierarchical tree structures preserving table row relationships. For "bad" tables, uses vision models to extract structured data.
- **Table Content Analysis**: When `classify_tables=True`, provides:
  - Table content classification
  - Table summaries
  - Key column identification

If debug mode is enabled (`debug_mode=True`), additional debug images might be saved, typically in a `debug/` subdirectory within the `output_dir`, showing intermediate steps of the parsing process.

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "kiwi-pdf-chunker",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": "Vahan Martirosyan <vahan@kiwidata.com>",
    "keywords": "pdf, ocr, parsing, document, layout, detection, chunking, cv",
    "author": null,
    "author_email": "Vahan Martirosyan / Kiwi Data <vahan@kiwidata.com>",
    "download_url": "https://files.pythonhosted.org/packages/35/e1/82bf97167f8c553d32d2a87d84784720423e31b6e5e6d15a29a9d6189e97/kiwi_pdf_chunker-0.3.3.tar.gz",
    "platform": null,
    "description": "# CV Document Chunker\n\nA Python package for parsing PDF document layouts using YOLO models, chunking content based on layout, performing  table parsing with Docling integration, and optionally performing OCR.\n\n## Features\n\n- Convert PDF documents to images for processing.\n- Detect document layout elements (e.g., paragraphs, tables, figures) using YOLO.\n- Process and refine bounding boxes.\n- Chunk document content based on detected layout.\n- Intelligent table detection and parsing using Docling integration.\n- Generate table summaries, identify key columns, and classify table content.\n- Perform OCR on detected elements using Azure Document Intelligence or Tesseract.\n- Save structured document data (layouts, chunks, OCR text, parsed tables) in JSON format.\n- Generate paragraph embeddings using OpenAI, Azure OpenAI, or Hugging Face APIs.\n- Build document hierarchy based on layout analysis.\n\n## Installation\n\n### Prerequisites\n\n- Python 3.10+\n- Pip package manager\n- (Optional but Recommended) CUDA-capable GPU for YOLO model inference acceleration.\n\n### Steps\n\n1.  **Install the Package:**\n    ```bash\n    pip install kiwi-pdf-chunker\n    ```\n\n2. **Install models locally:**\n\nYou must download the required models **once** to your machine. The library will then automatically find them in the default location (`~/.kiwi_pdf_chunker/`).\n\nFirst, run these shell commands to create the directories and download the `docling` models:\n\n```bash\n# Create directories for all models\nmkdir -p ~/.kiwi_pdf_chunker/docling_models\nmkdir -p ~/.kiwi_pdf_chunker/easyocr_models\n\n# Download Docling layout model\npython -m docling.cli.models download layout -o ~/.kiwi_pdf_chunker/docling_models\n\n# Download Docling table structure model (TableFormer)\npython -m docling.cli.models download tableformer -o ~/.kiwi_pdf_chunker/docling_models\n```\n\nNext, run this short Python script to download and cache the `EasyOCR` models:\n\n```python\nimport easyocr\nimport os\n\n# Define the target directory for EasyOCR models\ntarget_dir = os.path.expanduser(\"~/.kiwi_pdf_chunker/easyocr_models\")\n\n# Initialize the reader, which will automatically download the models to the target directory\nprint(\"Downloading EasyOCR models...\")\neasyocr.Reader(\n    ['en'],\n    gpu=False,\n    model_storage_directory=target_dir,\n    download_enabled=True\n)\nprint(f\"EasyOCR models successfully downloaded to: {target_dir}\")\n```\n\nYou should end up with this directory structure in your home folder:\n\n```\n~/.kiwi_pdf_chunker/\n\u251c\u2500\u2500 docling_models/\n\u2502   \u251c\u2500\u2500 ... (docling model files)\n\u2514\u2500\u2500 easyocr_models/\n    \u251c\u2500\u2500 craft_mlt_25k.pth\n    \u251c\u2500\u2500 english_g2.pth\n    \u2514\u2500\u2500 dict/\n        \u2514\u2500\u2500 en.txt\n```\n\nBy default, the package looks for models in the `~/.kiwi_pdf_chunker/` directory. You can override these paths using environment variables if you need to store the models elsewhere:\n\n```bash\nexport DOCLING_ARTIFACTS_PATH=/path/to/your/docling_models\nexport EASYOCR_MODULE_PATH=/path/to/your/easyocr_models\n```\n\n## User-Provided Data\n\nThis package requires the user to provide certain data externally:\n\n1.  **Input Directory (`input/`):** Place the PDF documents you want to process in a directory (e.g., `input/`). You will need to provide the path to your input file(s) when using the package.\n2.  **Models Directory (`models/`):** Download the necessary YOLO model(s) (e.g., `doclayout_yolo_docstructbench_imgsz1024.pt`) and place them in a dedicated directory (e.g., `models/`). The path to this directory (or the specific model file) will be needed by the parser.\n\n## Usage\n\n**Basic Usage:**\n\n```python\nfrom kiwi_pdf_chunker.main import PDFParser\n\n# --- User Configuration ---\ninput_pdf_path = \"path/to/your/input/document.pdf\" # Path to user's PDF\nmodel_path = \"path/to/your/models/doclayout_yolo.pt\" # Path to user's model\noutput_dir = \"path/to/your/output/\" # Directory to save results\n\n# Basic parser with OCR\nparser = PDFParser(\n    yolo_model_path=model_path,\n    ocr=True,\n    azure_ocr_endpoint=\"your_azure_endpoint\",\n    azure_ocr_key=\"your_azure_key\"\n)\n\nresults = parser.parse_document(input_pdf_path, output_dir=output_dir)\n```\n\n**Advanced Usage with Table Classification and Embeddings:**\n\n```python\nfrom kiwi_pdf_chunker.main import PDFParser\n\n# Advanced parser with table classification and embeddings\nparser = PDFParser(\n    yolo_model_path=model_path,\n    ocr=True,\n    azure_ocr_endpoint=\"your_azure_endpoint\",\n    azure_ocr_key=\"your_azure_key\",\n    embed=True,\n    classify_tables=True,\n    hf_token=\"your_huggingface_token\",  # For embeddings\n    hf_endpoint=\"your_huggingface_endpoint\",\n    azure_openai_api_key=\"your_azure_openai_key\",  # For table classification\n    azure_openai_api_version=\"2024-02-15-preview\",\n    azure_openai_endpoint=\"your_azure_openai_endpoint\"\n)\n\nresults = parser.parse_document(\n    input_pdf_path, \n    output_dir=output_dir,\n    generate_annotations=True,\n    save_bounding_boxes=True,\n    use_tesseract=False  # Use Azure OCR instead of Tesseract\n)\n```\n\n## Constructor Parameters\n\nThe `PDFParser` class accepts the following parameters:\n\n### Core Parameters\n- `yolo_model_path` (str, optional): Path to the YOLO model file. If None, uses the default path from config.\n- `debug_mode` (bool, optional): Enable debug mode with additional logging and outputs. Defaults to False.\n- `container_threshold` (int, optional): Minimum number of contained boxes required to remove a container box.\n- `hierarchy` (bool, optional): Enable hierarchy generation. Defaults to True.\n\n### OCR Parameters\n- `ocr` (bool, optional): Enable OCR processing. Defaults to False.\n- `azure_ocr_endpoint` (str, optional): Azure Document Intelligence endpoint URL.\n- `azure_ocr_key` (str, optional): Azure Document Intelligence API key.\n\n### Embedding Parameters\n- `embed` (bool, optional): If True, generate embeddings for extracted text. Defaults to False.\n- `embedding_model` (str, optional): Name of the OpenAI model for embeddings. Defaults to \"text-embedding-3-small\".\n- `openai_api_key` (str, optional): API key for standard OpenAI service.\n- `azure_openai_api_key` (str, optional): API key for Azure OpenAI service.\n- `azure_openai_api_version` (str, optional): API version for Azure OpenAI service.\n- `azure_openai_endpoint_embedding` (str, optional): Endpoint URL for Azure OpenAI service for text embeddings.\n- `hf_token` (str, optional): Hugging Face API token for embeddings.\n- `hf_endpoint` (str, optional): Hugging Face endpoint URL for embeddings.\n\n### Table Classification Parameters\n- `classify_tables` (bool, optional): If True, classify tables in the document. Defaults to False.\n- `table_categories` (list, optional): List of table categories for classification.\n- `azure_openai_endpoint` (str, optional): Endpoint URL for Azure OpenAI service for table classification.\n- `table_classification_system_prompt` (str, optional): System prompt for table classification.\n- `table_summary_system_prompt` (str, optional): System prompt for table summary generation.\n- `table_id_column_system_prompt` (str, optional): System prompt for table ID column identification.\n\n## Understanding the Output\n\nAfter running the parser, the following outputs will typically be available in the specified `output_dir`:\n\n1.  **`boxes.json`**: JSON file containing the detected document structure (element labels, coordinates, confidence).\n2.  **`tables.json`**: JSON file containing parsed table data with hierarchical structure for \"good\" tables and vision model output for \"bad\" tables.\n3.  **`table_screenshots/`**: Directory containing screenshots of detected tables for debugging and verification.\n4.  **`annotations/`**: Directory containing annotated images showing the detected elements for each page (if `generate_annotations=True`).\n5.  **`boxes/`**: Directory containing individual images for each detected element, organized by page number (if `save_bounding_boxes=True`). This is required for OCR.\n6.  **`text.json`**: (Only if `ocr=True`) JSON file containing the extracted text for each element, sorted according to the structure defined in `boxes.json`.\n7.  **`embeddings.json`**: (Only if `embed=True`) JSON file containing embeddings for each text element.\n8.  **`hierarchy.json`**: (Only if `hierarchy=True`) JSON file containing the document hierarchy structure.\n\n### Table Parsing Features\n\nThis package integrates [Docling](https://github.com/docling-project/docling) for PDF parsing.  \nIt is configured to run entirely offline with the required models bundled inside the package.\n\nThe library provides advanced table parsing capabilities:\n\n- **Automatic Table Detection**: Uses Docling integration for superior table detection.\n- **Table Structure Classification**: Automatically classifies tables as \"good\" (hierarchically parseable) or \"bad\" (requiring vision models).\n- **Table Parsing**: For \"good\" tables, builds hierarchical tree structures preserving table row relationships. For \"bad\" tables, uses vision models to extract structured data.\n- **Table Content Analysis**: When `classify_tables=True`, provides:\n  - Table content classification\n  - Table summaries\n  - Key column identification\n\nIf debug mode is enabled (`debug_mode=True`), additional debug images might be saved, typically in a `debug/` subdirectory within the `output_dir`, showing intermediate steps of the parsing process.\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A tool for parsing PDF document layouts and chunking content",
    "version": "0.3.3",
    "project_urls": {
        "Homepage": "https://github.com/Neo-License/cv_doc_chunker",
        "Issues": "https://github.com/Neo-License/cv_doc_chunker/issues",
        "Repository": "https://github.com/Neo-License/cv_doc_chunker"
    },
    "split_keywords": [
        "pdf",
        " ocr",
        " parsing",
        " document",
        " layout",
        " detection",
        " chunking",
        " cv"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "05d35675fc3f16f0dea4ca699d27eb58425091921f1c52d3e323752564385e52",
                "md5": "a04d62b35cd3783fbc51e3b51f0ea89f",
                "sha256": "bd157de709eae25e12e53cf04489dbf441b763b1abb33880c20cf28512b015aa"
            },
            "downloads": -1,
            "filename": "kiwi_pdf_chunker-0.3.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "a04d62b35cd3783fbc51e3b51f0ea89f",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 44376,
            "upload_time": "2025-10-08T10:31:39",
            "upload_time_iso_8601": "2025-10-08T10:31:39.810089Z",
            "url": "https://files.pythonhosted.org/packages/05/d3/5675fc3f16f0dea4ca699d27eb58425091921f1c52d3e323752564385e52/kiwi_pdf_chunker-0.3.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "35e182bf97167f8c553d32d2a87d84784720423e31b6e5e6d15a29a9d6189e97",
                "md5": "cb85f2b143b95f2dbacbec4808587afc",
                "sha256": "ec2536359fb67b3d4d8fa7c7ae8783c6e6553cff1f744217f857f210dcfde98e"
            },
            "downloads": -1,
            "filename": "kiwi_pdf_chunker-0.3.3.tar.gz",
            "has_sig": false,
            "md5_digest": "cb85f2b143b95f2dbacbec4808587afc",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 45509,
            "upload_time": "2025-10-08T10:31:42",
            "upload_time_iso_8601": "2025-10-08T10:31:42.914420Z",
            "url": "https://files.pythonhosted.org/packages/35/e1/82bf97167f8c553d32d2a87d84784720423e31b6e5e6d15a29a9d6189e97/kiwi_pdf_chunker-0.3.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-08 10:31:42",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Neo-License",
    "github_project": "cv_doc_chunker",
    "github_not_found": true,
    "lcname": "kiwi-pdf-chunker"
}
        
Elapsed time: 1.76913s