# OCR Detection Library
A Python library to analyze PDF pages and determine whether they contain extractable text or are scanned images requiring OCR processing.
**NEW in v0.3.0**: Smart Image Extraction provides **5x faster** performance for scanned PDFs with **33% less memory usage**! Now includes **40x faster** default processing mode and optimized parallel processing for large documents.
## Features
- **Page Type Detection**: Automatically classifies PDF pages as text, scanned, mixed, or empty
- **Smart Image Extraction**: 5x faster image processing for scanned PDFs using embedded images
- **Base64 Image Output**: Get page images as base64-encoded strings for visualization
- **Dual Processing Modes**: Fast mode (40x faster) for speed, accuracy mode for precision
- **Parallel Processing**: Fast analysis of large PDFs using multi-threading (up to 8x speedup)
- **Confidence Scoring**: Reliability indicators for classifications
- **Memory Efficient**: 33% reduction in memory usage with optimized image handling
- **Simple API**: Easy-to-use interface with minimal complexity
## Installation
```bash
# Clone or download the project
cd ocr-detection
# Install with uv (recommended)
uv sync
# Or install with pip
pip install ocr-detection
```
## Usage
### Quick Start
```python
from ocr_detection import detect_ocr
# RECOMMENDED: Serverless mode with images - optimal for most use cases
# (12-17s for 1000+ pages, includes optimized images for OCR processing)
result = detect_ocr("document.pdf", serverless_mode=True, include_images=True)
# RECOMMENDED: Serverless mode for classification only - ultra-fast
# (sub-2 seconds for 1000+ pages, no images)
result = detect_ocr("document.pdf", serverless_mode=True)
# Traditional fast mode - 40x faster than accuracy mode
result = detect_ocr("document.pdf")
# Accuracy mode - slowest but most precise
result = detect_ocr("document.pdf", accuracy_mode=True)
print(result)
# Output: {"status": "partial", "pages": [1, 3, 7, 12]}
# Check the status
if result['status'] == "true":
print("All pages need OCR")
elif result['status'] == "false":
print("No pages need OCR")
else: # partial
print(f"Pages needing OCR: {result['pages']}")
```
### Recommended Usage (Serverless Optimized)
**For Google Cloud Functions/Run and other serverless environments:**
```python
from ocr_detection import detect_ocr, OCRDetection
# Option 1: Quick function call with images (RECOMMENDED)
# Perfect balance of speed and functionality
result = detect_ocr("document.pdf", serverless_mode=True, include_images=True)
# Performance: 12-17s for 1000+ pages with optimized images
# Option 2: Classification only (ultra-fast)
# When you only need to know which pages need OCR
result = detect_ocr("document.pdf", serverless_mode=True)
# Performance: sub-2 seconds for 1000+ pages
# Option 3: Class-based approach
detector = OCRDetection(serverless_mode=True, include_images=True)
result = detector.detect("document.pdf")
```
### Using the OCRDetection Class
```python
from ocr_detection import OCRDetection
# RECOMMENDED: Serverless mode - optimal for most use cases
# Automatically enables metadata_only=True, optimized images, and conservative parallelization
serverless_detector = OCRDetection(serverless_mode=True)
# RECOMMENDED: Serverless mode with images for OCR processing
# (12-17s for 1000+ pages with optimized ultra-fast image generation)
serverless_with_images = OCRDetection(serverless_mode=True, include_images=True)
# Traditional fast mode - 40x faster than accuracy mode
detector = OCRDetection(
accuracy_mode=False, # Fast mode (default)
confidence_threshold=0.5, # Minimum confidence for OCR detection
parallel=True, # Enable parallel processing
include_images=False, # No images by default
image_format="png", # Image format: "png" or "jpeg"
image_dpi=150 # Image resolution (DPI)
)
# Accuracy mode - slowest but most precise
accurate_detector = OCRDetection(accuracy_mode=True)
# Analyze a document
result = detector.detect("document.pdf")
# With custom parallel settings for large documents
result = detector.detect("large_document.pdf", parallel=True, max_workers=8)
```
### Understanding Results
The library returns a dictionary with the following fields:
- **status**: Indicates the OCR requirement
- `"true"` - All pages need OCR processing
- `"false"` - No pages need OCR processing
- `"partial"` - Some pages need OCR processing
- **pages**: List of page numbers (1-indexed) that need OCR processing
- Empty list when status is `"false"`
- Contains all page numbers when status is `"true"`
- Contains specific page numbers when status is `"partial"`
- **page_images**: Dictionary mapping page numbers to base64-encoded images (when `include_images=True`)
- Only included for pages that need OCR processing
- Page numbers are 1-indexed to match PDF page numbering
- Images are base64-encoded PNG or JPEG strings
### Examples
```python
from ocr_detection import detect_ocr
# Example 1: Fully text-based PDF
result = detect_ocr("text_document.pdf")
# {"status": "false", "pages": []}
# Example 2: Scanned PDF
result = detect_ocr("scanned_document.pdf")
# {"status": "true", "pages": [1, 2, 3, 4, 5]}
# Example 3: Mixed content PDF
result = detect_ocr("mixed_document.pdf")
# {"status": "partial", "pages": [2, 5, 8]}
# Example 4: With base64 images
result = detect_ocr("document.pdf", include_images=True)
# {
# "status": "partial",
# "pages": [2, 5],
# "page_images": {
# 2: "iVBORw0KGgoAAAANSUhEUgAA...", # base64 PNG data
# 5: "iVBORw0KGgoAAAANSUhEUgAA..." # base64 PNG data
# }
# }
# Example 5: Custom image settings
result = detect_ocr(
"document.pdf",
include_images=True,
image_format="jpeg", # Use JPEG instead of PNG
image_dpi=200 # Higher resolution
)
# Example 6: With parallel processing for large PDFs
result = detect_ocr("large_document.pdf", parallel=True, max_workers=8)
# Example 7: Accuracy vs Speed modes
fast_result = detect_ocr("document.pdf") # Fast mode (default)
accurate_result = detect_ocr("document.pdf", accuracy_mode=True) # Accuracy mode
# Example 8: Serverless optimization (RECOMMENDED)
serverless_result = detect_ocr("document.pdf", serverless_mode=True, include_images=True) # Optimal balance
# Example 9: Ultra-fast classification only
classify_result = detect_ocr("document.pdf", serverless_mode=True) # Sub-2 seconds for 1000+ pages
```
## Image Output Options
The library can generate base64-encoded images of pages that need OCR processing:
### Parameters
- **include_images**: `bool` - Enable base64 image output (default: `False`)
- **image_format**: `str` - Output format: `"png"` or `"jpeg"` (default: `"png"`)
- **image_dpi**: `int` - Resolution in DPI (default: `150`)
### Usage Notes
- Images are only generated for pages that need OCR processing
- **Smart extraction**: Scanned pages use embedded images for 5x faster processing
- Higher DPI values produce larger but clearer images (only affects rendered pages)
- PNG format preserves quality but has larger file sizes
- JPEG format is more compact but may have compression artifacts
- Page numbers in `page_images` match those in the `pages` list (1-indexed)
## Performance
### Version 0.3.0 Optimization
The library now features **Smart Image Extraction** for dramatically improved performance:
- **5x faster** processing for scanned PDFs (2.5s → 0.54s)
- **33% memory reduction** (116MB → 79MB)
- **8x smaller** image data (15.9MB → 2.0MB)
- **20x faster** per-image processing (1.2s → 0.06s per image)
### How It Works
- **Scanned PDFs**: Extracts original embedded JPEG images directly (no re-rendering)
- **Text PDFs**: Uses traditional rendering for vector content
- **Quality Preservation**: Maintains original image compression and quality
- **Thread Safety**: Works seamlessly with parallel processing
### Processing Modes
**Fast Mode (Default)**:
- **40x faster** than accuracy mode
- Uses optimized text extraction (PyMuPDF only)
- Fast page classification heuristics
- Recommended for most use cases
**Accuracy Mode**:
- Maximum precision using dual text extraction
- Comprehensive text quality analysis
- Better for documents requiring high confidence
- Use when precision is more important than speed
### Automatic Optimization
The library automatically optimizes performance based on document size and content:
- Documents with ≤10 pages use sequential processing
- Larger documents automatically use parallel processing
- **Current parallel limit**: 8 workers (configurable)
- **Parallel speedup**: 3-8x performance improvement for large documents
- **Worker optimization**: `min(cpu_count, total_pages, max_workers)`
- Smart image extraction eliminates unnecessary rendering overhead
### Performance Tuning
```python
# For maximum speed on large documents
result = detect_ocr(
"large_document.pdf",
accuracy_mode=False, # Fast mode
parallel=True, # Enable parallel processing
max_workers=8 # Use up to 8 workers
)
# For maximum accuracy
result = detect_ocr(
"document.pdf",
accuracy_mode=True # Accuracy mode (slower)
)
# Custom worker count for high-core systems
result = detect_ocr(
"huge_document.pdf",
parallel=True,
max_workers=16 # Increase for powerful hardware
)
```
### Benchmark Results
**Large Document Test** (1045 pages, 3.9MB):
- **Fast mode**: ~8.0s
- **Fast mode + images**: ~33.7s
- **Parallel processing**: 3-8x faster than sequential
- **Memory usage**: Optimized with 33% reduction
**Performance Guidelines**:
- Use **fast mode** for general document analysis
- Use **accuracy mode** when precision is critical
- **Parallel processing** automatically enabled for >10 pages
- Increase `max_workers` on high-core systems for better performance
## License
MIT License - see LICENSE file for details
Raw data
{
"_id": null,
"home_page": null,
"name": "ocr-detection",
"maintainer": "satish",
"docs_url": null,
"requires_python": ">=3.13",
"maintainer_email": "satish <satish860@gmail.com>",
"keywords": "pdf, ocr, text-extraction, document-processing, pdf-analysis",
"author": "satish",
"author_email": "satish <satish860@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/8c/38/b76697fb20ef8d1ec78c5021e9fe47de10cc0866a1d04a15a97577d94eeb/ocr_detection-0.4.1.tar.gz",
"platform": null,
"description": "# OCR Detection Library\n\nA Python library to analyze PDF pages and determine whether they contain extractable text or are scanned images requiring OCR processing.\n\n**NEW in v0.3.0**: Smart Image Extraction provides **5x faster** performance for scanned PDFs with **33% less memory usage**! Now includes **40x faster** default processing mode and optimized parallel processing for large documents.\n\n## Features\n\n- **Page Type Detection**: Automatically classifies PDF pages as text, scanned, mixed, or empty\n- **Smart Image Extraction**: 5x faster image processing for scanned PDFs using embedded images\n- **Base64 Image Output**: Get page images as base64-encoded strings for visualization\n- **Dual Processing Modes**: Fast mode (40x faster) for speed, accuracy mode for precision\n- **Parallel Processing**: Fast analysis of large PDFs using multi-threading (up to 8x speedup)\n- **Confidence Scoring**: Reliability indicators for classifications\n- **Memory Efficient**: 33% reduction in memory usage with optimized image handling\n- **Simple API**: Easy-to-use interface with minimal complexity\n\n## Installation\n\n```bash\n# Clone or download the project\ncd ocr-detection\n\n# Install with uv (recommended)\nuv sync\n\n# Or install with pip\npip install ocr-detection\n```\n\n## Usage\n\n### Quick Start\n\n```python\nfrom ocr_detection import detect_ocr\n\n# RECOMMENDED: Serverless mode with images - optimal for most use cases\n# (12-17s for 1000+ pages, includes optimized images for OCR processing)\nresult = detect_ocr(\"document.pdf\", serverless_mode=True, include_images=True)\n\n# RECOMMENDED: Serverless mode for classification only - ultra-fast\n# (sub-2 seconds for 1000+ pages, no images)\nresult = detect_ocr(\"document.pdf\", serverless_mode=True)\n\n# Traditional fast mode - 40x faster than accuracy mode\nresult = detect_ocr(\"document.pdf\")\n\n# Accuracy mode - slowest but most precise\nresult = detect_ocr(\"document.pdf\", accuracy_mode=True)\n\nprint(result)\n# Output: {\"status\": \"partial\", \"pages\": [1, 3, 7, 12]}\n\n# Check the status\nif result['status'] == \"true\":\n print(\"All pages need OCR\")\nelif result['status'] == \"false\":\n print(\"No pages need OCR\")\nelse: # partial\n print(f\"Pages needing OCR: {result['pages']}\")\n```\n\n### Recommended Usage (Serverless Optimized)\n\n**For Google Cloud Functions/Run and other serverless environments:**\n\n```python\nfrom ocr_detection import detect_ocr, OCRDetection\n\n# Option 1: Quick function call with images (RECOMMENDED)\n# Perfect balance of speed and functionality\nresult = detect_ocr(\"document.pdf\", serverless_mode=True, include_images=True)\n# Performance: 12-17s for 1000+ pages with optimized images\n\n# Option 2: Classification only (ultra-fast)\n# When you only need to know which pages need OCR\nresult = detect_ocr(\"document.pdf\", serverless_mode=True)\n# Performance: sub-2 seconds for 1000+ pages\n\n# Option 3: Class-based approach\ndetector = OCRDetection(serverless_mode=True, include_images=True)\nresult = detector.detect(\"document.pdf\")\n```\n\n### Using the OCRDetection Class\n\n```python\nfrom ocr_detection import OCRDetection\n\n# RECOMMENDED: Serverless mode - optimal for most use cases\n# Automatically enables metadata_only=True, optimized images, and conservative parallelization\nserverless_detector = OCRDetection(serverless_mode=True)\n\n# RECOMMENDED: Serverless mode with images for OCR processing\n# (12-17s for 1000+ pages with optimized ultra-fast image generation)\nserverless_with_images = OCRDetection(serverless_mode=True, include_images=True)\n\n# Traditional fast mode - 40x faster than accuracy mode\ndetector = OCRDetection(\n accuracy_mode=False, # Fast mode (default)\n confidence_threshold=0.5, # Minimum confidence for OCR detection\n parallel=True, # Enable parallel processing\n include_images=False, # No images by default\n image_format=\"png\", # Image format: \"png\" or \"jpeg\"\n image_dpi=150 # Image resolution (DPI)\n)\n\n# Accuracy mode - slowest but most precise\naccurate_detector = OCRDetection(accuracy_mode=True)\n\n# Analyze a document\nresult = detector.detect(\"document.pdf\")\n\n# With custom parallel settings for large documents\nresult = detector.detect(\"large_document.pdf\", parallel=True, max_workers=8)\n```\n\n### Understanding Results\n\nThe library returns a dictionary with the following fields:\n\n- **status**: Indicates the OCR requirement\n - `\"true\"` - All pages need OCR processing\n - `\"false\"` - No pages need OCR processing \n - `\"partial\"` - Some pages need OCR processing\n\n- **pages**: List of page numbers (1-indexed) that need OCR processing\n - Empty list when status is `\"false\"`\n - Contains all page numbers when status is `\"true\"`\n - Contains specific page numbers when status is `\"partial\"`\n\n- **page_images**: Dictionary mapping page numbers to base64-encoded images (when `include_images=True`)\n - Only included for pages that need OCR processing\n - Page numbers are 1-indexed to match PDF page numbering\n - Images are base64-encoded PNG or JPEG strings\n\n### Examples\n\n```python\nfrom ocr_detection import detect_ocr\n\n# Example 1: Fully text-based PDF\nresult = detect_ocr(\"text_document.pdf\")\n# {\"status\": \"false\", \"pages\": []}\n\n# Example 2: Scanned PDF\nresult = detect_ocr(\"scanned_document.pdf\")\n# {\"status\": \"true\", \"pages\": [1, 2, 3, 4, 5]}\n\n# Example 3: Mixed content PDF\nresult = detect_ocr(\"mixed_document.pdf\")\n# {\"status\": \"partial\", \"pages\": [2, 5, 8]}\n\n# Example 4: With base64 images\nresult = detect_ocr(\"document.pdf\", include_images=True)\n# {\n# \"status\": \"partial\", \n# \"pages\": [2, 5], \n# \"page_images\": {\n# 2: \"iVBORw0KGgoAAAANSUhEUgAA...\", # base64 PNG data\n# 5: \"iVBORw0KGgoAAAANSUhEUgAA...\" # base64 PNG data\n# }\n# }\n\n# Example 5: Custom image settings\nresult = detect_ocr(\n \"document.pdf\", \n include_images=True,\n image_format=\"jpeg\", # Use JPEG instead of PNG\n image_dpi=200 # Higher resolution\n)\n\n# Example 6: With parallel processing for large PDFs\nresult = detect_ocr(\"large_document.pdf\", parallel=True, max_workers=8)\n\n# Example 7: Accuracy vs Speed modes\nfast_result = detect_ocr(\"document.pdf\") # Fast mode (default)\naccurate_result = detect_ocr(\"document.pdf\", accuracy_mode=True) # Accuracy mode\n\n# Example 8: Serverless optimization (RECOMMENDED)\nserverless_result = detect_ocr(\"document.pdf\", serverless_mode=True, include_images=True) # Optimal balance\n\n# Example 9: Ultra-fast classification only\nclassify_result = detect_ocr(\"document.pdf\", serverless_mode=True) # Sub-2 seconds for 1000+ pages\n```\n\n## Image Output Options\n\nThe library can generate base64-encoded images of pages that need OCR processing:\n\n### Parameters\n- **include_images**: `bool` - Enable base64 image output (default: `False`)\n- **image_format**: `str` - Output format: `\"png\"` or `\"jpeg\"` (default: `\"png\"`)\n- **image_dpi**: `int` - Resolution in DPI (default: `150`)\n\n### Usage Notes\n- Images are only generated for pages that need OCR processing\n- **Smart extraction**: Scanned pages use embedded images for 5x faster processing\n- Higher DPI values produce larger but clearer images (only affects rendered pages)\n- PNG format preserves quality but has larger file sizes\n- JPEG format is more compact but may have compression artifacts\n- Page numbers in `page_images` match those in the `pages` list (1-indexed)\n\n## Performance\n\n### Version 0.3.0 Optimization\n\nThe library now features **Smart Image Extraction** for dramatically improved performance:\n\n- **5x faster** processing for scanned PDFs (2.5s \u2192 0.54s)\n- **33% memory reduction** (116MB \u2192 79MB)\n- **8x smaller** image data (15.9MB \u2192 2.0MB)\n- **20x faster** per-image processing (1.2s \u2192 0.06s per image)\n\n### How It Works\n\n- **Scanned PDFs**: Extracts original embedded JPEG images directly (no re-rendering)\n- **Text PDFs**: Uses traditional rendering for vector content\n- **Quality Preservation**: Maintains original image compression and quality\n- **Thread Safety**: Works seamlessly with parallel processing\n\n### Processing Modes\n\n**Fast Mode (Default)**:\n- **40x faster** than accuracy mode\n- Uses optimized text extraction (PyMuPDF only)\n- Fast page classification heuristics\n- Recommended for most use cases\n\n**Accuracy Mode**:\n- Maximum precision using dual text extraction\n- Comprehensive text quality analysis\n- Better for documents requiring high confidence\n- Use when precision is more important than speed\n\n### Automatic Optimization\n\nThe library automatically optimizes performance based on document size and content:\n- Documents with \u226410 pages use sequential processing\n- Larger documents automatically use parallel processing\n- **Current parallel limit**: 8 workers (configurable)\n- **Parallel speedup**: 3-8x performance improvement for large documents\n- **Worker optimization**: `min(cpu_count, total_pages, max_workers)`\n- Smart image extraction eliminates unnecessary rendering overhead\n\n### Performance Tuning\n\n```python\n# For maximum speed on large documents\nresult = detect_ocr(\n \"large_document.pdf\",\n accuracy_mode=False, # Fast mode\n parallel=True, # Enable parallel processing\n max_workers=8 # Use up to 8 workers\n)\n\n# For maximum accuracy\nresult = detect_ocr(\n \"document.pdf\",\n accuracy_mode=True # Accuracy mode (slower)\n)\n\n# Custom worker count for high-core systems\nresult = detect_ocr(\n \"huge_document.pdf\",\n parallel=True,\n max_workers=16 # Increase for powerful hardware\n)\n```\n\n### Benchmark Results\n\n**Large Document Test** (1045 pages, 3.9MB):\n- **Fast mode**: ~8.0s\n- **Fast mode + images**: ~33.7s\n- **Parallel processing**: 3-8x faster than sequential\n- **Memory usage**: Optimized with 33% reduction\n\n**Performance Guidelines**:\n- Use **fast mode** for general document analysis\n- Use **accuracy mode** when precision is critical\n- **Parallel processing** automatically enabled for >10 pages\n- Increase `max_workers` on high-core systems for better performance\n\n## License\n\nMIT License - see LICENSE file for details",
"bugtrack_url": null,
"license": "MIT",
"summary": "A Python library to detect whether PDF pages contain extractable text or are scanned images requiring OCR",
"version": "0.4.1",
"project_urls": {
"Documentation": "https://github.com/satish860/ocr-detection#readme",
"Homepage": "https://github.com/satish860/ocr-detection",
"Issues": "https://github.com/satish860/ocr-detection/issues",
"Repository": "https://github.com/satish860/ocr-detection"
},
"split_keywords": [
"pdf",
" ocr",
" text-extraction",
" document-processing",
" pdf-analysis"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "9ef44494f058f4ecc13c7723ea34cd1c0538e243f7e8395d3726f79465ddb16e",
"md5": "610de67f6ec16c8ebc8de713ce194106",
"sha256": "d46d8c216c5438c52c83351924f2f2bb70ff0520c7c28dee32d36760eafa6548"
},
"downloads": -1,
"filename": "ocr_detection-0.4.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "610de67f6ec16c8ebc8de713ce194106",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.13",
"size": 22212,
"upload_time": "2025-08-22T07:27:09",
"upload_time_iso_8601": "2025-08-22T07:27:09.374326Z",
"url": "https://files.pythonhosted.org/packages/9e/f4/4494f058f4ecc13c7723ea34cd1c0538e243f7e8395d3726f79465ddb16e/ocr_detection-0.4.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "8c38b76697fb20ef8d1ec78c5021e9fe47de10cc0866a1d04a15a97577d94eeb",
"md5": "a54d3154ef87b5d14b546e282b3c848b",
"sha256": "4dd5948dde9db10160b947d8e5ce3007d227038ca87c900d57a48978b4ac29bb"
},
"downloads": -1,
"filename": "ocr_detection-0.4.1.tar.gz",
"has_sig": false,
"md5_digest": "a54d3154ef87b5d14b546e282b3c848b",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.13",
"size": 20560,
"upload_time": "2025-08-22T07:27:10",
"upload_time_iso_8601": "2025-08-22T07:27:10.681473Z",
"url": "https://files.pythonhosted.org/packages/8c/38/b76697fb20ef8d1ec78c5021e9fe47de10cc0866a1d04a15a97577d94eeb/ocr_detection-0.4.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-22 07:27:10",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "satish860",
"github_project": "ocr-detection#readme",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "ocr-detection"
}