# PaddleOCR Python Project
# pdf-searchable-ocr
A simple and powerful Python package for Optical Character Recognition (OCR) with searchable PDF generation using PaddleOCR.
## Features
- 🔍 **High-accuracy OCR** using PaddleOCR
- 📄 **Searchable PDF generation** with invisible text layers
- 🎨 **Bounding box visualization** for OCR results
- 🌍 **Multi-language support** (80+ languages)
- ⚡ **GPU acceleration** support
- 🔧 **Simple class-based API**
- 📦 **Easy installation and usage**
## Installation
### Using pip (recommended)
```bash
pip install pdf-searchable-ocr
```
### Using uv (for development)
```bash
git clone <repository-url>
cd pdf-searchable-ocr
uv sync
```
## Quick Start
### Basic Usage
```python
from py_ocr import OCRProcessor
# Initialize the OCR processor
ocr = OCRProcessor(lang='en', verbose=True)
# Process an image
ocr_result = ocr.process_image('path/to/your/image.jpg')
# Create a searchable PDF
pdf_path = ocr.create_searchable_pdf('path/to/your/image.jpg', ocr_result)
# Draw bounding boxes for visualization
boxed_image = ocr.draw_bounding_boxes('path/to/your/image.jpg', ocr_result)
print(f"Searchable PDF created: {pdf_path}")
print(f"Image with bounding boxes: {boxed_image}")
```
### CLI Usage
The package also provides a command-line tool:
```bash
# Basic usage - creates searchable PDF only
pdf-searchable-ocr input.jpg
# Specify custom output PDF name
pdf-searchable-ocr input.jpg --output-pdf my_document.pdf
# Enable bounding box visualization
pdf-searchable-ocr input.jpg --bounding-boxes
# Full options with custom names
pdf-searchable-ocr invoice.jpg \
--output-pdf invoice_searchable.pdf \
--output-prefix invoice_processed \
--bounding-boxes \
--lang en
```
### Complete Workflow
```python
from py_ocr import OCRProcessor
# Initialize processor
ocr = OCRProcessor(lang='en', use_gpu=False, verbose=True)
# Process image with custom PDF name and bounding boxes enabled
results = ocr.process_and_generate_all(
'invoice.jpg',
output_pdf='invoice_searchable.pdf',
output_prefix='invoice_processed',
bounding_boxes=True
)
if results['searchable_pdf']:
print(f"✅ Searchable PDF: {results['searchable_pdf']}")
if results['boxed_image']:
print(f"✅ Visualization: {results['boxed_image']}")
# Or use defaults (no bounding boxes)
results = ocr.process_and_generate_all('document.jpg')
```
### Using Sample Images
```python
from py_ocr import OCRProcessor
# Initialize processor
ocr = OCRProcessor()
# Download a sample image for testing
image_path = ocr.download_sample_image()
# Process with custom settings
results = ocr.process_and_generate_all(
image_path,
output_pdf='sample_searchable.pdf',
output_prefix='sample',
bounding_boxes=True # Enable visualization
)
```
## API Reference
### OCRProcessor Class
#### `__init__(lang='en', use_gpu=False, verbose=True, **kwargs)`
Initialize the OCR processor.
**Parameters:**
- `lang` (str): Language for OCR recognition (default: 'en')
- `use_gpu` (bool): Whether to use GPU acceleration (default: False)
- `verbose` (bool): Whether to print verbose output (default: True)
- `**kwargs`: Additional arguments passed to PaddleOCR
#### `process_image(image_path: str) -> Dict[str, Any]`
Perform OCR on an image.
**Parameters:**
- `image_path` (str): Path to the image file
**Returns:**
- `dict`: OCR results containing texts, scores, and bounding boxes
- `None`: If OCR failed
#### `create_searchable_pdf(image_path: str, ocr_result: dict, output_pdf: str) -> str`
Create a searchable PDF with invisible text layers.
**Parameters:**
- `image_path` (str): Path to the source image
- `ocr_result` (dict): OCR results from `process_image()`
- `output_pdf` (str): Output PDF filename (default: "searchable_output.pdf")
**Returns:**
- `str`: Path to the created PDF
- `None`: If creation failed
#### `draw_bounding_boxes(image_path: str, ocr_result: dict, output_image: str) -> str`
Draw bounding boxes on the image to visualize OCR detection.
**Parameters:**
- `image_path` (str): Path to the source image
- `ocr_result` (dict): OCR results from `process_image()`
- `output_image` (str): Output image filename (default: "image_with_boxes.jpg")
**Returns:**
- `str`: Path to the image with bounding boxes
- `None`: If creation failed
#### `process_and_generate_all(image_path: str, output_pdf: str, output_prefix: str, bounding_boxes: bool) -> dict`
Complete workflow: OCR + Searchable PDF + Optional Bounding Box Image.
**Parameters:**
- `image_path` (str): Path to the input image
- `output_pdf` (str): Output PDF filename (default: "searchable_output.pdf")
- `output_prefix` (str): Prefix for output files (default: "output")
- `bounding_boxes` (bool): Whether to generate bounding box visualization (default: False)
**Returns:**
- `dict`: Dictionary containing paths to all generated files
## Supported Languages
pdf-searchable-ocr supports 80+ languages through PaddleOCR. Some popular ones include:
- `en` - English
- `ch` - Chinese (Simplified)
- `french` - French
- `german` - German
- `korean` - Korean
- `japan` - Japanese
- `it` - Italian
- `xi` - Spanish
- `ru` - Russian
- `ar` - Arabic
For the complete list, see [PaddleOCR documentation](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.7/doc/doc_en/multi_languages_en.md).
## Advanced Configuration
### Output Control
```python
# Control output files and features
ocr = OCRProcessor(lang='en')
# Minimal processing - only searchable PDF
results = ocr.process_and_generate_all(
'document.jpg',
output_pdf='my_document.pdf',
bounding_boxes=False # Skip visualization
)
# Full processing with custom names
results = ocr.process_and_generate_all(
'invoice.jpg',
output_pdf='invoice_searchable.pdf',
output_prefix='invoice_analysis',
bounding_boxes=True # Include visualization
)
# Generated files:
# - invoice_searchable.pdf (searchable PDF)
# - invoice_analysis_with_boxes.jpg (visualization)
```
### GPU Acceleration
```python
# Enable GPU acceleration (requires CUDA)
ocr = OCRProcessor(lang='en', use_gpu=True)
```
### Custom PaddleOCR Settings
```python
# Pass additional PaddleOCR parameters
ocr = OCRProcessor(
lang='en',
use_angle_cls=True, # Enable angle classification
use_textline_orientation=True, # Enable text line orientation
det_model_dir='custom/det/path', # Custom detection model
rec_model_dir='custom/rec/path' # Custom recognition model
)
```
### Batch Processing
```python
from py_ocr import OCRProcessor
import os
ocr = OCRProcessor(lang='en')
# Process multiple images
image_folder = 'path/to/images'
for filename in os.listdir(image_folder):
if filename.lower().endswith(('.png', '.jpg', '.jpeg', '.bmp', '.tiff')):
image_path = os.path.join(image_folder, filename)
base_name = os.path.splitext(filename)[0]
# Process with custom output names
results = ocr.process_and_generate_all(
image_path,
output_pdf=f"searchable_{base_name}.pdf",
output_prefix=f"processed_{base_name}",
bounding_boxes=True # Generate visualizations
)
print(f"Processed: {filename}")
```
## Output Examples
### Console Output
```
🔧 Initializing OCR engine with language: en
✅ OCR engine initialized successfully
📁 Using existing sample image: sample_image.jpg
🔍 Processing image: sample_image.jpg
📊 Text blocks detected: 48 | Average confidence: 0.984
✅ Searchable PDF saved as: my_document.pdf
🎨 Drawing 48 bounding boxes on image...
✅ Image with bounding boxes saved as: output_with_boxes.jpg
```
### Generated Files
When using `bounding_boxes=True`:
- `my_document.pdf` - Searchable PDF with invisible text layers
- `output_with_boxes.jpg` - Original image with colored bounding boxes
When using `bounding_boxes=False` (default):
- `my_document.pdf` - Searchable PDF only (faster processing)
## Requirements
- Python >= 3.8
- PaddleOCR >= 2.7.0
- OpenCV >= 4.0
- ReportLab >= 4.0
- Pillow >= 8.0
## Installation from Source
```bash
# Clone the repository
git clone <repository-url>
cd pdf-searchable-ocr
# Install with uv (recommended for development)
uv sync
# Or install with pip
pip install -e .
```
## Development
### Running Tests
```bash
uv run python -m pytest tests/
```
### Code Formatting
```bash
uv run black py_ocr/
uv run isort py_ocr/
```
### Type Checking
```bash
uv run mypy py_ocr/
```
## Contributing
1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Commit your changes (`git commit -m 'Add amazing feature'`)
4. Push to the branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request
## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## Acknowledgments
- [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR) for the excellent OCR engine
- [ReportLab](https://www.reportlab.com/) for PDF generation capabilities
- [OpenCV](https://opencv.org/) for image processing
## Changelog
### v0.1.0
- Initial release
- Basic OCR functionality
- Searchable PDF generation
- Bounding box visualization
- Multi-language support
## Support
If you encounter any issues or have questions:
1. Check the [Issues](../../issues) page
2. Create a new issue with detailed information
3. Contact the maintainers
## Roadmap
- [ ] Web interface for easy usage
- [ ] Batch processing CLI tool
- [ ] Docker container
- [ ] Additional output formats (Excel, Word)
- [ ] OCR result caching
- [ ] Performance optimizations
Raw data
{
"_id": null,
"home_page": null,
"name": "pdf-searchable-ocr",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "Jasmin Mistry <mistry.jasmin@gmail.com>",
"keywords": "computer-vision, ocr, paddleocr, pdf, searchable-pdf, text-recognition",
"author": null,
"author_email": "Jasmin Mistry <mistry.jasmin@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/00/ca/3a1fded56e0e65e4f961e4d9b62715eabe3da7a93fb150495b039d8ecb04/pdf_searchable_ocr-0.1.1.tar.gz",
"platform": null,
"description": "# PaddleOCR Python Project\n\n# pdf-searchable-ocr\n\nA simple and powerful Python package for Optical Character Recognition (OCR) with searchable PDF generation using PaddleOCR.\n\n## Features\n\n- \ud83d\udd0d **High-accuracy OCR** using PaddleOCR\n- \ud83d\udcc4 **Searchable PDF generation** with invisible text layers\n- \ud83c\udfa8 **Bounding box visualization** for OCR results\n- \ud83c\udf0d **Multi-language support** (80+ languages)\n- \u26a1 **GPU acceleration** support\n- \ud83d\udd27 **Simple class-based API**\n- \ud83d\udce6 **Easy installation and usage**\n\n## Installation\n\n### Using pip (recommended)\n\n```bash\npip install pdf-searchable-ocr\n```\n\n### Using uv (for development)\n\n```bash\ngit clone <repository-url>\ncd pdf-searchable-ocr\nuv sync\n```\n\n## Quick Start\n\n### Basic Usage\n\n```python\nfrom py_ocr import OCRProcessor\n\n# Initialize the OCR processor\nocr = OCRProcessor(lang='en', verbose=True)\n\n# Process an image\nocr_result = ocr.process_image('path/to/your/image.jpg')\n\n# Create a searchable PDF\npdf_path = ocr.create_searchable_pdf('path/to/your/image.jpg', ocr_result)\n\n# Draw bounding boxes for visualization\nboxed_image = ocr.draw_bounding_boxes('path/to/your/image.jpg', ocr_result)\n\nprint(f\"Searchable PDF created: {pdf_path}\")\nprint(f\"Image with bounding boxes: {boxed_image}\")\n```\n\n### CLI Usage\n\nThe package also provides a command-line tool:\n\n```bash\n# Basic usage - creates searchable PDF only\npdf-searchable-ocr input.jpg\n\n# Specify custom output PDF name\npdf-searchable-ocr input.jpg --output-pdf my_document.pdf\n\n# Enable bounding box visualization\npdf-searchable-ocr input.jpg --bounding-boxes\n\n# Full options with custom names\npdf-searchable-ocr invoice.jpg \\\n --output-pdf invoice_searchable.pdf \\\n --output-prefix invoice_processed \\\n --bounding-boxes \\\n --lang en\n```\n\n### Complete Workflow\n\n```python\nfrom py_ocr import OCRProcessor\n\n# Initialize processor\nocr = OCRProcessor(lang='en', use_gpu=False, verbose=True)\n\n# Process image with custom PDF name and bounding boxes enabled\nresults = ocr.process_and_generate_all(\n 'invoice.jpg', \n output_pdf='invoice_searchable.pdf',\n output_prefix='invoice_processed',\n bounding_boxes=True\n)\n\nif results['searchable_pdf']:\n print(f\"\u2705 Searchable PDF: {results['searchable_pdf']}\")\nif results['boxed_image']:\n print(f\"\u2705 Visualization: {results['boxed_image']}\")\n\n# Or use defaults (no bounding boxes)\nresults = ocr.process_and_generate_all('document.jpg')\n```\n\n### Using Sample Images\n\n```python\nfrom py_ocr import OCRProcessor\n\n# Initialize processor\nocr = OCRProcessor()\n\n# Download a sample image for testing\nimage_path = ocr.download_sample_image()\n\n# Process with custom settings\nresults = ocr.process_and_generate_all(\n image_path,\n output_pdf='sample_searchable.pdf',\n output_prefix='sample',\n bounding_boxes=True # Enable visualization\n)\n```\n\n## API Reference\n\n### OCRProcessor Class\n\n#### `__init__(lang='en', use_gpu=False, verbose=True, **kwargs)`\n\nInitialize the OCR processor.\n\n**Parameters:**\n- `lang` (str): Language for OCR recognition (default: 'en')\n- `use_gpu` (bool): Whether to use GPU acceleration (default: False)\n- `verbose` (bool): Whether to print verbose output (default: True)\n- `**kwargs`: Additional arguments passed to PaddleOCR\n\n#### `process_image(image_path: str) -> Dict[str, Any]`\n\nPerform OCR on an image.\n\n**Parameters:**\n- `image_path` (str): Path to the image file\n\n**Returns:**\n- `dict`: OCR results containing texts, scores, and bounding boxes\n- `None`: If OCR failed\n\n#### `create_searchable_pdf(image_path: str, ocr_result: dict, output_pdf: str) -> str`\n\nCreate a searchable PDF with invisible text layers.\n\n**Parameters:**\n- `image_path` (str): Path to the source image\n- `ocr_result` (dict): OCR results from `process_image()`\n- `output_pdf` (str): Output PDF filename (default: \"searchable_output.pdf\")\n\n**Returns:**\n- `str`: Path to the created PDF\n- `None`: If creation failed\n\n#### `draw_bounding_boxes(image_path: str, ocr_result: dict, output_image: str) -> str`\n\nDraw bounding boxes on the image to visualize OCR detection.\n\n**Parameters:**\n- `image_path` (str): Path to the source image\n- `ocr_result` (dict): OCR results from `process_image()`\n- `output_image` (str): Output image filename (default: \"image_with_boxes.jpg\")\n\n**Returns:**\n- `str`: Path to the image with bounding boxes\n- `None`: If creation failed\n\n#### `process_and_generate_all(image_path: str, output_pdf: str, output_prefix: str, bounding_boxes: bool) -> dict`\n\nComplete workflow: OCR + Searchable PDF + Optional Bounding Box Image.\n\n**Parameters:**\n- `image_path` (str): Path to the input image\n- `output_pdf` (str): Output PDF filename (default: \"searchable_output.pdf\")\n- `output_prefix` (str): Prefix for output files (default: \"output\")\n- `bounding_boxes` (bool): Whether to generate bounding box visualization (default: False)\n\n**Returns:**\n- `dict`: Dictionary containing paths to all generated files\n\n## Supported Languages\n\npdf-searchable-ocr supports 80+ languages through PaddleOCR. Some popular ones include:\n\n- `en` - English\n- `ch` - Chinese (Simplified)\n- `french` - French\n- `german` - German\n- `korean` - Korean\n- `japan` - Japanese\n- `it` - Italian\n- `xi` - Spanish\n- `ru` - Russian\n- `ar` - Arabic\n\nFor the complete list, see [PaddleOCR documentation](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.7/doc/doc_en/multi_languages_en.md).\n\n## Advanced Configuration\n\n### Output Control\n\n```python\n# Control output files and features\nocr = OCRProcessor(lang='en')\n\n# Minimal processing - only searchable PDF\nresults = ocr.process_and_generate_all(\n 'document.jpg',\n output_pdf='my_document.pdf',\n bounding_boxes=False # Skip visualization\n)\n\n# Full processing with custom names\nresults = ocr.process_and_generate_all(\n 'invoice.jpg',\n output_pdf='invoice_searchable.pdf',\n output_prefix='invoice_analysis',\n bounding_boxes=True # Include visualization\n)\n\n# Generated files:\n# - invoice_searchable.pdf (searchable PDF)\n# - invoice_analysis_with_boxes.jpg (visualization)\n```\n\n### GPU Acceleration\n\n```python\n# Enable GPU acceleration (requires CUDA)\nocr = OCRProcessor(lang='en', use_gpu=True)\n```\n\n### Custom PaddleOCR Settings\n\n```python\n# Pass additional PaddleOCR parameters\nocr = OCRProcessor(\n lang='en',\n use_angle_cls=True, # Enable angle classification\n use_textline_orientation=True, # Enable text line orientation\n det_model_dir='custom/det/path', # Custom detection model\n rec_model_dir='custom/rec/path' # Custom recognition model\n)\n```\n\n### Batch Processing\n\n```python\nfrom py_ocr import OCRProcessor\nimport os\n\nocr = OCRProcessor(lang='en')\n\n# Process multiple images\nimage_folder = 'path/to/images'\nfor filename in os.listdir(image_folder):\n if filename.lower().endswith(('.png', '.jpg', '.jpeg', '.bmp', '.tiff')):\n image_path = os.path.join(image_folder, filename)\n base_name = os.path.splitext(filename)[0]\n \n # Process with custom output names\n results = ocr.process_and_generate_all(\n image_path, \n output_pdf=f\"searchable_{base_name}.pdf\",\n output_prefix=f\"processed_{base_name}\",\n bounding_boxes=True # Generate visualizations\n )\n print(f\"Processed: {filename}\")\n```\n\n## Output Examples\n\n### Console Output\n```\n\ud83d\udd27 Initializing OCR engine with language: en\n\u2705 OCR engine initialized successfully\n\ud83d\udcc1 Using existing sample image: sample_image.jpg\n\ud83d\udd0d Processing image: sample_image.jpg\n\ud83d\udcca Text blocks detected: 48 | Average confidence: 0.984\n\u2705 Searchable PDF saved as: my_document.pdf\n\ud83c\udfa8 Drawing 48 bounding boxes on image...\n\u2705 Image with bounding boxes saved as: output_with_boxes.jpg\n```\n\n### Generated Files\nWhen using `bounding_boxes=True`:\n- `my_document.pdf` - Searchable PDF with invisible text layers\n- `output_with_boxes.jpg` - Original image with colored bounding boxes\n\nWhen using `bounding_boxes=False` (default):\n- `my_document.pdf` - Searchable PDF only (faster processing)\n\n## Requirements\n\n- Python >= 3.8\n- PaddleOCR >= 2.7.0\n- OpenCV >= 4.0\n- ReportLab >= 4.0\n- Pillow >= 8.0\n\n## Installation from Source\n\n```bash\n# Clone the repository\ngit clone <repository-url>\ncd pdf-searchable-ocr\n\n# Install with uv (recommended for development)\nuv sync\n\n# Or install with pip\npip install -e .\n```\n\n## Development\n\n### Running Tests\n```bash\nuv run python -m pytest tests/\n```\n\n### Code Formatting\n```bash\nuv run black py_ocr/\nuv run isort py_ocr/\n```\n\n### Type Checking\n```bash\nuv run mypy py_ocr/\n```\n\n## Contributing\n\n1. Fork the repository\n2. Create a feature branch (`git checkout -b feature/amazing-feature`)\n3. Commit your changes (`git commit -m 'Add amazing feature'`)\n4. Push to the branch (`git push origin feature/amazing-feature`)\n5. Open a Pull Request\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## Acknowledgments\n\n- [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR) for the excellent OCR engine\n- [ReportLab](https://www.reportlab.com/) for PDF generation capabilities\n- [OpenCV](https://opencv.org/) for image processing\n\n## Changelog\n\n### v0.1.0\n- Initial release\n- Basic OCR functionality\n- Searchable PDF generation\n- Bounding box visualization\n- Multi-language support\n\n## Support\n\nIf you encounter any issues or have questions:\n\n1. Check the [Issues](../../issues) page\n2. Create a new issue with detailed information\n3. Contact the maintainers\n\n## Roadmap\n\n- [ ] Web interface for easy usage\n- [ ] Batch processing CLI tool\n- [ ] Docker container\n- [ ] Additional output formats (Excel, Word)\n- [ ] OCR result caching\n- [ ] Performance optimizations",
"bugtrack_url": null,
"license": "MIT",
"summary": "A simple Python package for OCR with searchable PDF generation using PaddleOCR",
"version": "0.1.1",
"project_urls": {
"Bug Tracker": "https://github.com/jasminmistry/pdf-searchable-ocr/issues",
"Documentation": "https://github.com/jasminmistry/pdf-searchable-ocr#readme",
"Homepage": "https://github.com/jasminmistry/pdf-searchable-ocr",
"Repository": "https://github.com/jasminmistry/pdf-searchable-ocr.git"
},
"split_keywords": [
"computer-vision",
" ocr",
" paddleocr",
" pdf",
" searchable-pdf",
" text-recognition"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "0bf7294005d2fa843219f2f57fb19db6d520d6e413917caa2912f874441bb222",
"md5": "030b392f2befd0fa71336ae8326a9016",
"sha256": "5cb5b709155b89348590e444f8232fa429767ab94104528d97122da7a01753f1"
},
"downloads": -1,
"filename": "pdf_searchable_ocr-0.1.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "030b392f2befd0fa71336ae8326a9016",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 10987,
"upload_time": "2025-10-09T12:39:50",
"upload_time_iso_8601": "2025-10-09T12:39:50.620948Z",
"url": "https://files.pythonhosted.org/packages/0b/f7/294005d2fa843219f2f57fb19db6d520d6e413917caa2912f874441bb222/pdf_searchable_ocr-0.1.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "00ca3a1fded56e0e65e4f961e4d9b62715eabe3da7a93fb150495b039d8ecb04",
"md5": "d3c62010d081655de4c1112536eee6eb",
"sha256": "103611c1e4f231d38883d888692baf99ef2654bc90e96bf04ee73f002ec11941"
},
"downloads": -1,
"filename": "pdf_searchable_ocr-0.1.1.tar.gz",
"has_sig": false,
"md5_digest": "d3c62010d081655de4c1112536eee6eb",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 1129833,
"upload_time": "2025-10-09T12:39:52",
"upload_time_iso_8601": "2025-10-09T12:39:52.238544Z",
"url": "https://files.pythonhosted.org/packages/00/ca/3a1fded56e0e65e4f961e4d9b62715eabe3da7a93fb150495b039d8ecb04/pdf_searchable_ocr-0.1.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-09 12:39:52",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "jasminmistry",
"github_project": "pdf-searchable-ocr",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "pdf-searchable-ocr"
}