atai-pdf-tool


Nameatai-pdf-tool JSON
Version 0.1.0 PyPI version JSON
download
home_pagehttps://github.com/AtomGradient/atai-pdf-tool
SummaryA tool for parsing and extracting text from PDF files with OCR capabilities
upload_time2025-02-27 11:15:46
maintainerNone
docs_urlNone
authorAtomGradient
requires_python>=3.6
licenseMIT
keywords pdf ocr text-extraction document-processing
VCS
bugtrack_url
requirements PyPDF2 PyMuPDF easyocr tqdm
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # atai-pdf-tool

A command-line tool for parsing and extracting text from PDF files with OCR capabilities and performance optimization options.

## Installation

```bash
pip install atai-pdf-tool
```

## Usage

### Command Line Interface

#### Basic Usage with Default Settings

```bash
atai-pdf-tool path/to/your/document.pdf -o output.json
```

#### Parallel Processing (Faster for Multi-core Systems)

```bash
atai-pdf-tool path/to/your/document.pdf -o output.json --parallel --max-workers 4
```

#### Lower DPI for Faster Processing

```bash
atai-pdf-tool path/to/your/document.pdf -o output.json --dpi 150
```

#### Batch Processing for Large PDFs (Memory-Efficient)

```bash
atai-pdf-tool path/to/your/document.pdf -o output.json --batch --batch-size 10
```

#### OCR-Only Mode with Parallel Processing

```bash
atai-pdf-tool path/to/your/document.pdf -o output.json --ocr-only --parallel --gpu
```

#### Process Specific Page Range with Optimizations

```bash
atai-pdf-tool path/to/your/document.pdf -s 5 -e 15 -o output.json --parallel --dpi 180 --gpu
```

### Options:

- `-s`, `--start-page`: Starting page number (0-indexed, default: 0)
- `-e`, `--end-page`: Ending page number (0-indexed, default: last page)
- `-o`, `--output`: Output JSON file path (if not provided, prints to stdout)
- `--ocr-only`: Use OCR for all pages regardless of extractable text
- `-l`, `--lang`: Language for OCR processing (default: en)
- `--parallel`: Enable parallel processing for faster performance (multi-core systems)
- `--max-workers`: Control the number of parallel workers for processing
- `--dpi`: Control image resolution for OCR (lower DPI improves speed)
- `--batch`: Use memory-efficient batch processing for large PDFs
- `--batch-size`: Control the batch size for batch processing
- `--ocr-threshold`: Set the threshold for when to fallback to OCR
- `--gpu`: Enable GPU acceleration for OCR processing

### Supported Languages

The language option (`-l`, `--lang`) accepts language codes supported by EasyOCR. Some common ones include:

- `en`: English
- `ch_sim`: Simplified Chinese
- `ch_tra`: Traditional Chinese
- `fr`: French
- `de`: German
- `jp`: Japanese
- `ko`: Korean
- `sp`: Spanish

For a complete list of language codes, see the [EasyOCR documentation](https://www.jaided.ai/easyocr/).

### As a Python Module

```python
from atai_pdf_tool.parser import extract_pdf_pages, ocr_pdf, save_as_json

# Extract text from specific pages with English OCR
text = extract_pdf_pages("document.pdf", start_page=0, end_page=5, lang="en")

# Extract text with different language
chinese_text = extract_pdf_pages("chinese_document.pdf", lang="ch_sim")

# Extract without progress bar
text = extract_pdf_pages("document.pdf", show_progress=False)

# Save to JSON
save_as_json(text, "output.json")

# OCR an entire PDF with a specific language
french_ocr_text = ocr_pdf("french_document.pdf", lang="fr")
```

## Key Improvements and Performance Enhancements

- **Parallel Processing**: Use multiple CPU cores for faster processing of large PDFs.
- **DPI Control**: Adjust the resolution for OCR processing to balance speed and quality (`--dpi`).
- **Batch Processing**: Process large PDFs in memory-efficient batches (`--batch`, `--batch-size`).
- **GPU Acceleration**: Leverage GPU resources for OCR processing (`--gpu`).
- **OCR Threshold**: Set a configurable threshold for when to switch to OCR processing (`--ocr-threshold`).
- **Reused OCR Reader**: Optimized OCR integration for better speed, especially with multi-page documents.

These updates allow you to customize the extraction process based on hardware capabilities, whether you're looking for faster processing or better memory efficiency.

## License

This project is licensed under the MIT License - see the LICENSE file for details.

---

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/AtomGradient/atai-pdf-tool",
    "name": "atai-pdf-tool",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": null,
    "keywords": "pdf, ocr, text-extraction, document-processing",
    "author": "AtomGradient",
    "author_email": "AtomGradient <alex@atomgradient.com>",
    "download_url": "https://files.pythonhosted.org/packages/d8/5e/e7f18a053dfb4a2a3e9273eb9f5781dbeadc70adefd3412d8b3bd04d9549/atai_pdf_tool-0.1.0.tar.gz",
    "platform": null,
    "description": "# atai-pdf-tool\n\nA command-line tool for parsing and extracting text from PDF files with OCR capabilities and performance optimization options.\n\n## Installation\n\n```bash\npip install atai-pdf-tool\n```\n\n## Usage\n\n### Command Line Interface\n\n#### Basic Usage with Default Settings\n\n```bash\natai-pdf-tool path/to/your/document.pdf -o output.json\n```\n\n#### Parallel Processing (Faster for Multi-core Systems)\n\n```bash\natai-pdf-tool path/to/your/document.pdf -o output.json --parallel --max-workers 4\n```\n\n#### Lower DPI for Faster Processing\n\n```bash\natai-pdf-tool path/to/your/document.pdf -o output.json --dpi 150\n```\n\n#### Batch Processing for Large PDFs (Memory-Efficient)\n\n```bash\natai-pdf-tool path/to/your/document.pdf -o output.json --batch --batch-size 10\n```\n\n#### OCR-Only Mode with Parallel Processing\n\n```bash\natai-pdf-tool path/to/your/document.pdf -o output.json --ocr-only --parallel --gpu\n```\n\n#### Process Specific Page Range with Optimizations\n\n```bash\natai-pdf-tool path/to/your/document.pdf -s 5 -e 15 -o output.json --parallel --dpi 180 --gpu\n```\n\n### Options:\n\n- `-s`, `--start-page`: Starting page number (0-indexed, default: 0)\n- `-e`, `--end-page`: Ending page number (0-indexed, default: last page)\n- `-o`, `--output`: Output JSON file path (if not provided, prints to stdout)\n- `--ocr-only`: Use OCR for all pages regardless of extractable text\n- `-l`, `--lang`: Language for OCR processing (default: en)\n- `--parallel`: Enable parallel processing for faster performance (multi-core systems)\n- `--max-workers`: Control the number of parallel workers for processing\n- `--dpi`: Control image resolution for OCR (lower DPI improves speed)\n- `--batch`: Use memory-efficient batch processing for large PDFs\n- `--batch-size`: Control the batch size for batch processing\n- `--ocr-threshold`: Set the threshold for when to fallback to OCR\n- `--gpu`: Enable GPU acceleration for OCR processing\n\n### Supported Languages\n\nThe language option (`-l`, `--lang`) accepts language codes supported by EasyOCR. Some common ones include:\n\n- `en`: English\n- `ch_sim`: Simplified Chinese\n- `ch_tra`: Traditional Chinese\n- `fr`: French\n- `de`: German\n- `jp`: Japanese\n- `ko`: Korean\n- `sp`: Spanish\n\nFor a complete list of language codes, see the [EasyOCR documentation](https://www.jaided.ai/easyocr/).\n\n### As a Python Module\n\n```python\nfrom atai_pdf_tool.parser import extract_pdf_pages, ocr_pdf, save_as_json\n\n# Extract text from specific pages with English OCR\ntext = extract_pdf_pages(\"document.pdf\", start_page=0, end_page=5, lang=\"en\")\n\n# Extract text with different language\nchinese_text = extract_pdf_pages(\"chinese_document.pdf\", lang=\"ch_sim\")\n\n# Extract without progress bar\ntext = extract_pdf_pages(\"document.pdf\", show_progress=False)\n\n# Save to JSON\nsave_as_json(text, \"output.json\")\n\n# OCR an entire PDF with a specific language\nfrench_ocr_text = ocr_pdf(\"french_document.pdf\", lang=\"fr\")\n```\n\n## Key Improvements and Performance Enhancements\n\n- **Parallel Processing**: Use multiple CPU cores for faster processing of large PDFs.\n- **DPI Control**: Adjust the resolution for OCR processing to balance speed and quality (`--dpi`).\n- **Batch Processing**: Process large PDFs in memory-efficient batches (`--batch`, `--batch-size`).\n- **GPU Acceleration**: Leverage GPU resources for OCR processing (`--gpu`).\n- **OCR Threshold**: Set a configurable threshold for when to switch to OCR processing (`--ocr-threshold`).\n- **Reused OCR Reader**: Optimized OCR integration for better speed, especially with multi-page documents.\n\nThese updates allow you to customize the extraction process based on hardware capabilities, whether you're looking for faster processing or better memory efficiency.\n\n## License\n\nThis project is licensed under the MIT License - see the LICENSE file for details.\n\n---\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A tool for parsing and extracting text from PDF files with OCR capabilities",
    "version": "0.1.0",
    "project_urls": {
        "Bug Tracker": "https://github.com/AtomGradient/atai-pdf-tool/issues",
        "Homepage": "https://github.com/AtomGradient/atai-pdf-tool"
    },
    "split_keywords": [
        "pdf",
        " ocr",
        " text-extraction",
        " document-processing"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "527a80ea11d4af3dcfb07f98ff53ae607ce8147aee9a1958a9a4b5b607a3c132",
                "md5": "1328f9fc26c146570475b1a382bc6180",
                "sha256": "37f77501052ef93c98bdcd665e23c9030efc57da318908d80800d10f933dc98d"
            },
            "downloads": -1,
            "filename": "atai_pdf_tool-0.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "1328f9fc26c146570475b1a382bc6180",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6",
            "size": 8783,
            "upload_time": "2025-02-27T11:15:43",
            "upload_time_iso_8601": "2025-02-27T11:15:43.646208Z",
            "url": "https://files.pythonhosted.org/packages/52/7a/80ea11d4af3dcfb07f98ff53ae607ce8147aee9a1958a9a4b5b607a3c132/atai_pdf_tool-0.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "d85ee7f18a053dfb4a2a3e9273eb9f5781dbeadc70adefd3412d8b3bd04d9549",
                "md5": "5e35022b2db07fa9dd436db13785f72c",
                "sha256": "03ff0b7a7ad60755bb4884b31924b6388e11578deb4b0c8b2f6e577e531f61da"
            },
            "downloads": -1,
            "filename": "atai_pdf_tool-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "5e35022b2db07fa9dd436db13785f72c",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 9461,
            "upload_time": "2025-02-27T11:15:46",
            "upload_time_iso_8601": "2025-02-27T11:15:46.378215Z",
            "url": "https://files.pythonhosted.org/packages/d8/5e/e7f18a053dfb4a2a3e9273eb9f5781dbeadc70adefd3412d8b3bd04d9549/atai_pdf_tool-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-02-27 11:15:46",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "AtomGradient",
    "github_project": "atai-pdf-tool",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "PyPDF2",
            "specs": []
        },
        {
            "name": "PyMuPDF",
            "specs": []
        },
        {
            "name": "easyocr",
            "specs": []
        },
        {
            "name": "tqdm",
            "specs": []
        }
    ],
    "lcname": "atai-pdf-tool"
}
        
Elapsed time: 0.43527s