# atai-pdf-tool
A command-line tool for parsing and extracting text from PDF files with OCR capabilities and performance optimization options.
## Installation
```bash
pip install atai-pdf-tool
```
## Usage
### Command Line Interface
#### Basic Usage with Default Settings
```bash
atai-pdf-tool path/to/your/document.pdf -o output.json
```
#### Parallel Processing (Faster for Multi-core Systems)
```bash
atai-pdf-tool path/to/your/document.pdf -o output.json --parallel --max-workers 4
```
#### Lower DPI for Faster Processing
```bash
atai-pdf-tool path/to/your/document.pdf -o output.json --dpi 150
```
#### Batch Processing for Large PDFs (Memory-Efficient)
```bash
atai-pdf-tool path/to/your/document.pdf -o output.json --batch --batch-size 10
```
#### OCR-Only Mode with Parallel Processing
```bash
atai-pdf-tool path/to/your/document.pdf -o output.json --ocr-only --parallel --gpu
```
#### Process Specific Page Range with Optimizations
```bash
atai-pdf-tool path/to/your/document.pdf -s 5 -e 15 -o output.json --parallel --dpi 180 --gpu
```
### Options:
- `-s`, `--start-page`: Starting page number (0-indexed, default: 0)
- `-e`, `--end-page`: Ending page number (0-indexed, default: last page)
- `-o`, `--output`: Output JSON file path (if not provided, prints to stdout)
- `--ocr-only`: Use OCR for all pages regardless of extractable text
- `-l`, `--lang`: Language for OCR processing (default: en)
- `--parallel`: Enable parallel processing for faster performance (multi-core systems)
- `--max-workers`: Control the number of parallel workers for processing
- `--dpi`: Control image resolution for OCR (lower DPI improves speed)
- `--batch`: Use memory-efficient batch processing for large PDFs
- `--batch-size`: Control the batch size for batch processing
- `--ocr-threshold`: Set the threshold for when to fallback to OCR
- `--gpu`: Enable GPU acceleration for OCR processing
### Supported Languages
The language option (`-l`, `--lang`) accepts language codes supported by EasyOCR. Some common ones include:
- `en`: English
- `ch_sim`: Simplified Chinese
- `ch_tra`: Traditional Chinese
- `fr`: French
- `de`: German
- `jp`: Japanese
- `ko`: Korean
- `sp`: Spanish
For a complete list of language codes, see the [EasyOCR documentation](https://www.jaided.ai/easyocr/).
### As a Python Module
```python
from atai_pdf_tool.parser import extract_pdf_pages, ocr_pdf, save_as_json
# Extract text from specific pages with English OCR
text = extract_pdf_pages("document.pdf", start_page=0, end_page=5, lang="en")
# Extract text with different language
chinese_text = extract_pdf_pages("chinese_document.pdf", lang="ch_sim")
# Extract without progress bar
text = extract_pdf_pages("document.pdf", show_progress=False)
# Save to JSON
save_as_json(text, "output.json")
# OCR an entire PDF with a specific language
french_ocr_text = ocr_pdf("french_document.pdf", lang="fr")
```
## Key Improvements and Performance Enhancements
- **Parallel Processing**: Use multiple CPU cores for faster processing of large PDFs.
- **DPI Control**: Adjust the resolution for OCR processing to balance speed and quality (`--dpi`).
- **Batch Processing**: Process large PDFs in memory-efficient batches (`--batch`, `--batch-size`).
- **GPU Acceleration**: Leverage GPU resources for OCR processing (`--gpu`).
- **OCR Threshold**: Set a configurable threshold for when to switch to OCR processing (`--ocr-threshold`).
- **Reused OCR Reader**: Optimized OCR integration for better speed, especially with multi-page documents.
These updates allow you to customize the extraction process based on hardware capabilities, whether you're looking for faster processing or better memory efficiency.
## License
This project is licensed under the MIT License - see the LICENSE file for details.
---
Raw data
{
"_id": null,
"home_page": "https://github.com/AtomGradient/atai-pdf-tool",
"name": "atai-pdf-tool",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.6",
"maintainer_email": null,
"keywords": "pdf, ocr, text-extraction, document-processing",
"author": "AtomGradient",
"author_email": "AtomGradient <alex@atomgradient.com>",
"download_url": "https://files.pythonhosted.org/packages/d8/5e/e7f18a053dfb4a2a3e9273eb9f5781dbeadc70adefd3412d8b3bd04d9549/atai_pdf_tool-0.1.0.tar.gz",
"platform": null,
"description": "# atai-pdf-tool\n\nA command-line tool for parsing and extracting text from PDF files with OCR capabilities and performance optimization options.\n\n## Installation\n\n```bash\npip install atai-pdf-tool\n```\n\n## Usage\n\n### Command Line Interface\n\n#### Basic Usage with Default Settings\n\n```bash\natai-pdf-tool path/to/your/document.pdf -o output.json\n```\n\n#### Parallel Processing (Faster for Multi-core Systems)\n\n```bash\natai-pdf-tool path/to/your/document.pdf -o output.json --parallel --max-workers 4\n```\n\n#### Lower DPI for Faster Processing\n\n```bash\natai-pdf-tool path/to/your/document.pdf -o output.json --dpi 150\n```\n\n#### Batch Processing for Large PDFs (Memory-Efficient)\n\n```bash\natai-pdf-tool path/to/your/document.pdf -o output.json --batch --batch-size 10\n```\n\n#### OCR-Only Mode with Parallel Processing\n\n```bash\natai-pdf-tool path/to/your/document.pdf -o output.json --ocr-only --parallel --gpu\n```\n\n#### Process Specific Page Range with Optimizations\n\n```bash\natai-pdf-tool path/to/your/document.pdf -s 5 -e 15 -o output.json --parallel --dpi 180 --gpu\n```\n\n### Options:\n\n- `-s`, `--start-page`: Starting page number (0-indexed, default: 0)\n- `-e`, `--end-page`: Ending page number (0-indexed, default: last page)\n- `-o`, `--output`: Output JSON file path (if not provided, prints to stdout)\n- `--ocr-only`: Use OCR for all pages regardless of extractable text\n- `-l`, `--lang`: Language for OCR processing (default: en)\n- `--parallel`: Enable parallel processing for faster performance (multi-core systems)\n- `--max-workers`: Control the number of parallel workers for processing\n- `--dpi`: Control image resolution for OCR (lower DPI improves speed)\n- `--batch`: Use memory-efficient batch processing for large PDFs\n- `--batch-size`: Control the batch size for batch processing\n- `--ocr-threshold`: Set the threshold for when to fallback to OCR\n- `--gpu`: Enable GPU acceleration for OCR processing\n\n### Supported Languages\n\nThe language option (`-l`, `--lang`) accepts language codes supported by EasyOCR. Some common ones include:\n\n- `en`: English\n- `ch_sim`: Simplified Chinese\n- `ch_tra`: Traditional Chinese\n- `fr`: French\n- `de`: German\n- `jp`: Japanese\n- `ko`: Korean\n- `sp`: Spanish\n\nFor a complete list of language codes, see the [EasyOCR documentation](https://www.jaided.ai/easyocr/).\n\n### As a Python Module\n\n```python\nfrom atai_pdf_tool.parser import extract_pdf_pages, ocr_pdf, save_as_json\n\n# Extract text from specific pages with English OCR\ntext = extract_pdf_pages(\"document.pdf\", start_page=0, end_page=5, lang=\"en\")\n\n# Extract text with different language\nchinese_text = extract_pdf_pages(\"chinese_document.pdf\", lang=\"ch_sim\")\n\n# Extract without progress bar\ntext = extract_pdf_pages(\"document.pdf\", show_progress=False)\n\n# Save to JSON\nsave_as_json(text, \"output.json\")\n\n# OCR an entire PDF with a specific language\nfrench_ocr_text = ocr_pdf(\"french_document.pdf\", lang=\"fr\")\n```\n\n## Key Improvements and Performance Enhancements\n\n- **Parallel Processing**: Use multiple CPU cores for faster processing of large PDFs.\n- **DPI Control**: Adjust the resolution for OCR processing to balance speed and quality (`--dpi`).\n- **Batch Processing**: Process large PDFs in memory-efficient batches (`--batch`, `--batch-size`).\n- **GPU Acceleration**: Leverage GPU resources for OCR processing (`--gpu`).\n- **OCR Threshold**: Set a configurable threshold for when to switch to OCR processing (`--ocr-threshold`).\n- **Reused OCR Reader**: Optimized OCR integration for better speed, especially with multi-page documents.\n\nThese updates allow you to customize the extraction process based on hardware capabilities, whether you're looking for faster processing or better memory efficiency.\n\n## License\n\nThis project is licensed under the MIT License - see the LICENSE file for details.\n\n---\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A tool for parsing and extracting text from PDF files with OCR capabilities",
"version": "0.1.0",
"project_urls": {
"Bug Tracker": "https://github.com/AtomGradient/atai-pdf-tool/issues",
"Homepage": "https://github.com/AtomGradient/atai-pdf-tool"
},
"split_keywords": [
"pdf",
" ocr",
" text-extraction",
" document-processing"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "527a80ea11d4af3dcfb07f98ff53ae607ce8147aee9a1958a9a4b5b607a3c132",
"md5": "1328f9fc26c146570475b1a382bc6180",
"sha256": "37f77501052ef93c98bdcd665e23c9030efc57da318908d80800d10f933dc98d"
},
"downloads": -1,
"filename": "atai_pdf_tool-0.1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "1328f9fc26c146570475b1a382bc6180",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.6",
"size": 8783,
"upload_time": "2025-02-27T11:15:43",
"upload_time_iso_8601": "2025-02-27T11:15:43.646208Z",
"url": "https://files.pythonhosted.org/packages/52/7a/80ea11d4af3dcfb07f98ff53ae607ce8147aee9a1958a9a4b5b607a3c132/atai_pdf_tool-0.1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "d85ee7f18a053dfb4a2a3e9273eb9f5781dbeadc70adefd3412d8b3bd04d9549",
"md5": "5e35022b2db07fa9dd436db13785f72c",
"sha256": "03ff0b7a7ad60755bb4884b31924b6388e11578deb4b0c8b2f6e577e531f61da"
},
"downloads": -1,
"filename": "atai_pdf_tool-0.1.0.tar.gz",
"has_sig": false,
"md5_digest": "5e35022b2db07fa9dd436db13785f72c",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.6",
"size": 9461,
"upload_time": "2025-02-27T11:15:46",
"upload_time_iso_8601": "2025-02-27T11:15:46.378215Z",
"url": "https://files.pythonhosted.org/packages/d8/5e/e7f18a053dfb4a2a3e9273eb9f5781dbeadc70adefd3412d8b3bd04d9549/atai_pdf_tool-0.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-02-27 11:15:46",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "AtomGradient",
"github_project": "atai-pdf-tool",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "PyPDF2",
"specs": []
},
{
"name": "PyMuPDF",
"specs": []
},
{
"name": "easyocr",
"specs": []
},
{
"name": "tqdm",
"specs": []
}
],
"lcname": "atai-pdf-tool"
}