# CV Document Chunker
A Python package for parsing PDF document layouts using YOLO models, chunking content based on layout, and optionally performing OCR.
## Features
- Convert PDF documents to images for processing.
- Detect document layout elements (e.g., paragraphs, tables, figures) using YOLO.
- Process and refine bounding boxes.
- Chunk document content based on detected layout.
- **(Optional)** Perform OCR on detected elements using Azure Document Intelligence.
- Save structured document data (layouts, chunks, OCR text) in JSON format.
- Get paragraph embeddings using OpenAI embedder
## Installation
### Prerequisites
- Python 3.10+
- Pip package manager
- (Optional but Recommended) CUDA-capable GPU for YOLO model inference acceleration.
### Steps
1. **Install the Package:**
```bash
# pip install kiwi-pdf-chunker
```
## User-Provided Data
This package requires the user to provide certain data externally:
1. **Input Directory (`input/`):** Place the PDF documents you want to process in a directory (e.g., `input/`). You will need to provide the path to your input file(s) when using the package.
2. **Models Directory (`models/`):** Download the necessary YOLO model(s) (e.g., `doclayout_yolo_docstructbench_imgsz1024.pt`) and place them in a dedicated directory (e.g., `models/`). The path to this directory (or the specific model file) will be needed by the parser.
Raw data
{
"_id": null,
"home_page": null,
"name": "kiwi-pdf-chunker",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": "Vahan Martirosyan <vahan@kiwidata.com>",
"keywords": "pdf, ocr, parsing, document, layout, detection, chunking, cv",
"author": null,
"author_email": "Vahan Martirosyan / Kiwi Data <vahan@kiwidata.com>",
"download_url": "https://files.pythonhosted.org/packages/cd/71/e9cb5fbcfbccdf61489133f6e40d9620e29bfcddd908248f1d0d07c73cba/kiwi_pdf_chunker-0.2.0.tar.gz",
"platform": null,
"description": "# CV Document Chunker\n\nA Python package for parsing PDF document layouts using YOLO models, chunking content based on layout, and optionally performing OCR.\n\n## Features\n\n- Convert PDF documents to images for processing.\n- Detect document layout elements (e.g., paragraphs, tables, figures) using YOLO.\n- Process and refine bounding boxes.\n- Chunk document content based on detected layout.\n- **(Optional)** Perform OCR on detected elements using Azure Document Intelligence.\n- Save structured document data (layouts, chunks, OCR text) in JSON format.\n- Get paragraph embeddings using OpenAI embedder \n\n## Installation\n\n### Prerequisites\n\n- Python 3.10+\n- Pip package manager\n- (Optional but Recommended) CUDA-capable GPU for YOLO model inference acceleration.\n\n### Steps\n\n1. **Install the Package:**\n ```bash\n # pip install kiwi-pdf-chunker\n ```\n\n## User-Provided Data\n\nThis package requires the user to provide certain data externally:\n\n1. **Input Directory (`input/`):** Place the PDF documents you want to process in a directory (e.g., `input/`). You will need to provide the path to your input file(s) when using the package.\n2. **Models Directory (`models/`):** Download the necessary YOLO model(s) (e.g., `doclayout_yolo_docstructbench_imgsz1024.pt`) and place them in a dedicated directory (e.g., `models/`). The path to this directory (or the specific model file) will be needed by the parser.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A tool for parsing PDF document layouts and chunking content",
"version": "0.2.0",
"project_urls": {
"Homepage": "https://github.com/NeolicenseVahan/kiwi-pdf-chunker",
"Issues": "https://github.com/NeolicenseVahan/kiwi-pdf-chunker/issues",
"Repository": "https://github.com/NeolicenseVahan/kiwi-pdf-chunker"
},
"split_keywords": [
"pdf",
" ocr",
" parsing",
" document",
" layout",
" detection",
" chunking",
" cv"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "e80f574269b434318f24e0df5cbac3d5831728074debf86edfbc69ed94fe1dc7",
"md5": "3dc08d8fc3042e015125c53e56534636",
"sha256": "0769144ce42e4a567c1459e1291027e7badc09d36b1e8cf0ef63b421df3e9b9b"
},
"downloads": -1,
"filename": "kiwi_pdf_chunker-0.2.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "3dc08d8fc3042e015125c53e56534636",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 27620,
"upload_time": "2025-08-19T18:25:51",
"upload_time_iso_8601": "2025-08-19T18:25:51.004887Z",
"url": "https://files.pythonhosted.org/packages/e8/0f/574269b434318f24e0df5cbac3d5831728074debf86edfbc69ed94fe1dc7/kiwi_pdf_chunker-0.2.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "cd71e9cb5fbcfbccdf61489133f6e40d9620e29bfcddd908248f1d0d07c73cba",
"md5": "9684ee984e903a52072716e45d5eb0a7",
"sha256": "86eaf8682e91705debccd4e35394352ccc8c8ffeb06063526bd238c39c18ec28"
},
"downloads": -1,
"filename": "kiwi_pdf_chunker-0.2.0.tar.gz",
"has_sig": false,
"md5_digest": "9684ee984e903a52072716e45d5eb0a7",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 26542,
"upload_time": "2025-08-19T18:25:52",
"upload_time_iso_8601": "2025-08-19T18:25:52.221430Z",
"url": "https://files.pythonhosted.org/packages/cd/71/e9cb5fbcfbccdf61489133f6e40d9620e29bfcddd908248f1d0d07c73cba/kiwi_pdf_chunker-0.2.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-19 18:25:52",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "NeolicenseVahan",
"github_project": "kiwi-pdf-chunker",
"github_not_found": true,
"lcname": "kiwi-pdf-chunker"
}