kiwi-pdf-chunker


Namekiwi-pdf-chunker JSON
Version 0.2.0 PyPI version JSON
download
home_pageNone
SummaryA tool for parsing PDF document layouts and chunking content
upload_time2025-08-19 18:25:52
maintainerNone
docs_urlNone
authorNone
requires_python>=3.10
licenseMIT
keywords pdf ocr parsing document layout detection chunking cv
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # CV Document Chunker

A Python package for parsing PDF document layouts using YOLO models, chunking content based on layout, and optionally performing OCR.

## Features

- Convert PDF documents to images for processing.
- Detect document layout elements (e.g., paragraphs, tables, figures) using YOLO.
- Process and refine bounding boxes.
- Chunk document content based on detected layout.
- **(Optional)** Perform OCR on detected elements using Azure Document Intelligence.
- Save structured document data (layouts, chunks, OCR text) in JSON format.
- Get paragraph embeddings using OpenAI embedder 

## Installation

### Prerequisites

- Python 3.10+
- Pip package manager
- (Optional but Recommended) CUDA-capable GPU for YOLO model inference acceleration.

### Steps

1.  **Install the Package:**
    ```bash
    # pip install kiwi-pdf-chunker
    ```

## User-Provided Data

This package requires the user to provide certain data externally:

1.  **Input Directory (`input/`):** Place the PDF documents you want to process in a directory (e.g., `input/`). You will need to provide the path to your input file(s) when using the package.
2.  **Models Directory (`models/`):** Download the necessary YOLO model(s) (e.g., `doclayout_yolo_docstructbench_imgsz1024.pt`) and place them in a dedicated directory (e.g., `models/`). The path to this directory (or the specific model file) will be needed by the parser.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "kiwi-pdf-chunker",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": "Vahan Martirosyan <vahan@kiwidata.com>",
    "keywords": "pdf, ocr, parsing, document, layout, detection, chunking, cv",
    "author": null,
    "author_email": "Vahan Martirosyan / Kiwi Data <vahan@kiwidata.com>",
    "download_url": "https://files.pythonhosted.org/packages/cd/71/e9cb5fbcfbccdf61489133f6e40d9620e29bfcddd908248f1d0d07c73cba/kiwi_pdf_chunker-0.2.0.tar.gz",
    "platform": null,
    "description": "# CV Document Chunker\n\nA Python package for parsing PDF document layouts using YOLO models, chunking content based on layout, and optionally performing OCR.\n\n## Features\n\n- Convert PDF documents to images for processing.\n- Detect document layout elements (e.g., paragraphs, tables, figures) using YOLO.\n- Process and refine bounding boxes.\n- Chunk document content based on detected layout.\n- **(Optional)** Perform OCR on detected elements using Azure Document Intelligence.\n- Save structured document data (layouts, chunks, OCR text) in JSON format.\n- Get paragraph embeddings using OpenAI embedder \n\n## Installation\n\n### Prerequisites\n\n- Python 3.10+\n- Pip package manager\n- (Optional but Recommended) CUDA-capable GPU for YOLO model inference acceleration.\n\n### Steps\n\n1.  **Install the Package:**\n    ```bash\n    # pip install kiwi-pdf-chunker\n    ```\n\n## User-Provided Data\n\nThis package requires the user to provide certain data externally:\n\n1.  **Input Directory (`input/`):** Place the PDF documents you want to process in a directory (e.g., `input/`). You will need to provide the path to your input file(s) when using the package.\n2.  **Models Directory (`models/`):** Download the necessary YOLO model(s) (e.g., `doclayout_yolo_docstructbench_imgsz1024.pt`) and place them in a dedicated directory (e.g., `models/`). The path to this directory (or the specific model file) will be needed by the parser.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A tool for parsing PDF document layouts and chunking content",
    "version": "0.2.0",
    "project_urls": {
        "Homepage": "https://github.com/NeolicenseVahan/kiwi-pdf-chunker",
        "Issues": "https://github.com/NeolicenseVahan/kiwi-pdf-chunker/issues",
        "Repository": "https://github.com/NeolicenseVahan/kiwi-pdf-chunker"
    },
    "split_keywords": [
        "pdf",
        " ocr",
        " parsing",
        " document",
        " layout",
        " detection",
        " chunking",
        " cv"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "e80f574269b434318f24e0df5cbac3d5831728074debf86edfbc69ed94fe1dc7",
                "md5": "3dc08d8fc3042e015125c53e56534636",
                "sha256": "0769144ce42e4a567c1459e1291027e7badc09d36b1e8cf0ef63b421df3e9b9b"
            },
            "downloads": -1,
            "filename": "kiwi_pdf_chunker-0.2.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "3dc08d8fc3042e015125c53e56534636",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 27620,
            "upload_time": "2025-08-19T18:25:51",
            "upload_time_iso_8601": "2025-08-19T18:25:51.004887Z",
            "url": "https://files.pythonhosted.org/packages/e8/0f/574269b434318f24e0df5cbac3d5831728074debf86edfbc69ed94fe1dc7/kiwi_pdf_chunker-0.2.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "cd71e9cb5fbcfbccdf61489133f6e40d9620e29bfcddd908248f1d0d07c73cba",
                "md5": "9684ee984e903a52072716e45d5eb0a7",
                "sha256": "86eaf8682e91705debccd4e35394352ccc8c8ffeb06063526bd238c39c18ec28"
            },
            "downloads": -1,
            "filename": "kiwi_pdf_chunker-0.2.0.tar.gz",
            "has_sig": false,
            "md5_digest": "9684ee984e903a52072716e45d5eb0a7",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 26542,
            "upload_time": "2025-08-19T18:25:52",
            "upload_time_iso_8601": "2025-08-19T18:25:52.221430Z",
            "url": "https://files.pythonhosted.org/packages/cd/71/e9cb5fbcfbccdf61489133f6e40d9620e29bfcddd908248f1d0d07c73cba/kiwi_pdf_chunker-0.2.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-19 18:25:52",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "NeolicenseVahan",
    "github_project": "kiwi-pdf-chunker",
    "github_not_found": true,
    "lcname": "kiwi-pdf-chunker"
}
        
Elapsed time: 0.52015s