ocr-detection


Nameocr-detection JSON
Version 0.1.2 PyPI version JSON
download
home_pageNone
SummaryA Python library to detect whether PDF pages contain extractable text or are scanned images requiring OCR
upload_time2025-08-13 04:29:13
maintainersatish
docs_urlNone
authorsatish
requires_python>=3.13
licenseMIT
keywords pdf ocr text-extraction document-processing pdf-analysis
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # OCR Detection Library

A Python library to analyze PDF pages and determine whether they contain extractable text or are scanned images requiring OCR processing.

## Features

- **Page Type Detection**: Automatically classifies PDF pages as text, scanned, mixed, or empty
- **Parallel Processing**: Fast analysis of large PDFs using multi-threading
- **Confidence Scoring**: Reliability indicators for classifications
- **Simple API**: Easy-to-use interface with minimal complexity

## Installation

```bash
# Clone or download the project
cd ocr-detection

# Install with uv (recommended)
uv sync

# Or install with pip
pip install -e .
```

## Usage

### Quick Start

```python
from ocr_detection import detect_ocr

# Analyze a PDF document
result = detect_ocr("document.pdf")

print(result)
# Output: {"status": "partial", "pages": [1, 3, 7, 12]}

# Check the status
if result['status'] == "true":
    print("All pages need OCR")
elif result['status'] == "false":
    print("No pages need OCR")
else:  # partial
    print(f"Pages needing OCR: {result['pages']}")
```

### Using the OCRDetection Class

```python
from ocr_detection import OCRDetection

# Initialize detector with options
detector = OCRDetection(
    confidence_threshold=0.5,  # Minimum confidence for OCR detection
    parallel=True              # Enable parallel processing
)

# Analyze a document
result = detector.detect("document.pdf")

# With custom parallel settings
result = detector.detect("large_document.pdf", max_workers=4)
```

### Understanding Results

The library returns a simple dictionary with two fields:

- **status**: Indicates the OCR requirement
  - `"true"` - All pages need OCR processing
  - `"false"` - No pages need OCR processing  
  - `"partial"` - Some pages need OCR processing

- **pages**: List of page numbers (1-indexed) that need OCR processing
  - Empty list when status is `"false"`
  - Contains all page numbers when status is `"true"`
  - Contains specific page numbers when status is `"partial"`

### Examples

```python
from ocr_detection import detect_ocr

# Example 1: Fully text-based PDF
result = detect_ocr("text_document.pdf")
# {"status": "false", "pages": []}

# Example 2: Scanned PDF
result = detect_ocr("scanned_document.pdf")
# {"status": "true", "pages": [1, 2, 3, 4, 5]}

# Example 3: Mixed content PDF
result = detect_ocr("mixed_document.pdf")
# {"status": "partial", "pages": [2, 5, 8]}

# Example 4: With parallel processing for large PDFs
result = detect_ocr("large_document.pdf", parallel=True)
```

## Performance

The library automatically optimizes performance based on document size:
- Documents with ≤10 pages use sequential processing
- Larger documents use parallel processing with configurable worker threads
- Parallel processing provides 3-8x performance improvement for large documents

## License

MIT License - see LICENSE file for details
            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "ocr-detection",
    "maintainer": "satish",
    "docs_url": null,
    "requires_python": ">=3.13",
    "maintainer_email": "satish <satish860@gmail.com>",
    "keywords": "pdf, ocr, text-extraction, document-processing, pdf-analysis",
    "author": "satish",
    "author_email": "satish <satish860@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/f9/a5/ea931fc22f29b134760408ca21ab5502de10175fa6e4ed4d275beade3aa8/ocr_detection-0.1.2.tar.gz",
    "platform": null,
    "description": "# OCR Detection Library\n\nA Python library to analyze PDF pages and determine whether they contain extractable text or are scanned images requiring OCR processing.\n\n## Features\n\n- **Page Type Detection**: Automatically classifies PDF pages as text, scanned, mixed, or empty\n- **Parallel Processing**: Fast analysis of large PDFs using multi-threading\n- **Confidence Scoring**: Reliability indicators for classifications\n- **Simple API**: Easy-to-use interface with minimal complexity\n\n## Installation\n\n```bash\n# Clone or download the project\ncd ocr-detection\n\n# Install with uv (recommended)\nuv sync\n\n# Or install with pip\npip install -e .\n```\n\n## Usage\n\n### Quick Start\n\n```python\nfrom ocr_detection import detect_ocr\n\n# Analyze a PDF document\nresult = detect_ocr(\"document.pdf\")\n\nprint(result)\n# Output: {\"status\": \"partial\", \"pages\": [1, 3, 7, 12]}\n\n# Check the status\nif result['status'] == \"true\":\n    print(\"All pages need OCR\")\nelif result['status'] == \"false\":\n    print(\"No pages need OCR\")\nelse:  # partial\n    print(f\"Pages needing OCR: {result['pages']}\")\n```\n\n### Using the OCRDetection Class\n\n```python\nfrom ocr_detection import OCRDetection\n\n# Initialize detector with options\ndetector = OCRDetection(\n    confidence_threshold=0.5,  # Minimum confidence for OCR detection\n    parallel=True              # Enable parallel processing\n)\n\n# Analyze a document\nresult = detector.detect(\"document.pdf\")\n\n# With custom parallel settings\nresult = detector.detect(\"large_document.pdf\", max_workers=4)\n```\n\n### Understanding Results\n\nThe library returns a simple dictionary with two fields:\n\n- **status**: Indicates the OCR requirement\n  - `\"true\"` - All pages need OCR processing\n  - `\"false\"` - No pages need OCR processing  \n  - `\"partial\"` - Some pages need OCR processing\n\n- **pages**: List of page numbers (1-indexed) that need OCR processing\n  - Empty list when status is `\"false\"`\n  - Contains all page numbers when status is `\"true\"`\n  - Contains specific page numbers when status is `\"partial\"`\n\n### Examples\n\n```python\nfrom ocr_detection import detect_ocr\n\n# Example 1: Fully text-based PDF\nresult = detect_ocr(\"text_document.pdf\")\n# {\"status\": \"false\", \"pages\": []}\n\n# Example 2: Scanned PDF\nresult = detect_ocr(\"scanned_document.pdf\")\n# {\"status\": \"true\", \"pages\": [1, 2, 3, 4, 5]}\n\n# Example 3: Mixed content PDF\nresult = detect_ocr(\"mixed_document.pdf\")\n# {\"status\": \"partial\", \"pages\": [2, 5, 8]}\n\n# Example 4: With parallel processing for large PDFs\nresult = detect_ocr(\"large_document.pdf\", parallel=True)\n```\n\n## Performance\n\nThe library automatically optimizes performance based on document size:\n- Documents with \u226410 pages use sequential processing\n- Larger documents use parallel processing with configurable worker threads\n- Parallel processing provides 3-8x performance improvement for large documents\n\n## License\n\nMIT License - see LICENSE file for details",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A Python library to detect whether PDF pages contain extractable text or are scanned images requiring OCR",
    "version": "0.1.2",
    "project_urls": {
        "Documentation": "https://github.com/satish860/ocr-detection#readme",
        "Homepage": "https://github.com/satish860/ocr-detection",
        "Issues": "https://github.com/satish860/ocr-detection/issues",
        "Repository": "https://github.com/satish860/ocr-detection"
    },
    "split_keywords": [
        "pdf",
        " ocr",
        " text-extraction",
        " document-processing",
        " pdf-analysis"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "778bf6a35265a62472f8e9f8a6fef7c9f2269c5ec70073086239c730e1285db4",
                "md5": "024cb6b60e6cb5d723e482f5863a175f",
                "sha256": "ef5791ee85c7668e85d305b181e2956c2e4925cd990e1aea452738380db87705"
            },
            "downloads": -1,
            "filename": "ocr_detection-0.1.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "024cb6b60e6cb5d723e482f5863a175f",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.13",
            "size": 14523,
            "upload_time": "2025-08-13T04:29:12",
            "upload_time_iso_8601": "2025-08-13T04:29:12.425388Z",
            "url": "https://files.pythonhosted.org/packages/77/8b/f6a35265a62472f8e9f8a6fef7c9f2269c5ec70073086239c730e1285db4/ocr_detection-0.1.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "f9a5ea931fc22f29b134760408ca21ab5502de10175fa6e4ed4d275beade3aa8",
                "md5": "2208865d9d61b0ecc46cf1ddd6a02817",
                "sha256": "017c632acb4b02073b243f30b62f4d77e24bf21ffe3d0c86f4abeaa90e7201d1"
            },
            "downloads": -1,
            "filename": "ocr_detection-0.1.2.tar.gz",
            "has_sig": false,
            "md5_digest": "2208865d9d61b0ecc46cf1ddd6a02817",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.13",
            "size": 13005,
            "upload_time": "2025-08-13T04:29:13",
            "upload_time_iso_8601": "2025-08-13T04:29:13.714901Z",
            "url": "https://files.pythonhosted.org/packages/f9/a5/ea931fc22f29b134760408ca21ab5502de10175fa6e4ed4d275beade3aa8/ocr_detection-0.1.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-13 04:29:13",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "satish860",
    "github_project": "ocr-detection#readme",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "ocr-detection"
}
        
Elapsed time: 1.48751s