# DocForge π¨
**Forge perfect documents from any format with precision, power, and simplicity.**
DocForge is a comprehensive document processing toolkit built on proven implementations with a modern modular architecture. Born from real-world needs and battle-tested algorithms, DocForge transforms how you work with documents.
## β¨ Features
- π **OCR Processing**: Convert scanned PDFs to searchable documents with precision
- ποΈ **Smart Optimization**: Reduce file sizes without compromising quality
- βοΈ **Batch Processing**: Handle hundreds of documents efficiently
- π§ **Document Analysis**: Extract insights and metadata
- π― **Modular Design**: Use only what you need, extend easily
## π Why DocForge?
- **Battle-tested OCR algorithms** with Windows compatibility
- **Advanced optimization techniques** from real-world usage
- **Memory-efficient batch processing** for large-scale operations
- **Clean, modular codebase** that's easy to understand and extend
- **Comprehensive error handling** and logging
- **Both programmatic API and command-line interface**
## π¦ Installation
### Option 1: Install from PyPI (when available)
```bash
pip install docforge
```
### Option 2: Install from source
```bash
git clone https://github.com/oscar2song/docforge.git
cd docforge
pip install -e .
```
### System Dependencies
**Ubuntu/Debian:**
```bash
sudo apt-get install tesseract-ocr poppler-utils
```
**macOS:**
```bash
brew install tesseract poppler
```
**Windows:**
Download Tesseract from: https://github.com/tesseract-ocr/tesseract
## π― Quick Start
### Command Line Interface
After installation, use the `docforge` command:
```bash
# Get help
docforge --help
# OCR a scanned PDF
docforge enhanced-ocr -i scanned_document.pdf -o searchable_document.pdf
# Batch OCR processing
docforge enhanced-batch-ocr -i scanned_folder/ -o searchable_folder/
# Standard OCR processing
docforge ocr -i document.pdf -o output.pdf --language eng
# Batch optimization
docforge batch-ocr -i input_folder/ -o output_folder/
# Test the interface
docforge test-rich
# Run performance benchmarks
docforge benchmark --test-files document.pdf
```
### Programmatic API
```python
from docforge import DocumentProcessor
# Initialize the processor
processor = DocumentProcessor(verbose=True)
# OCR a scanned PDF
result = processor.ocr_pdf(
"scanned_document.pdf",
"searchable_document.pdf",
language='eng'
)
# Optimize PDF size
result = processor.optimize_pdf(
"large_document.pdf",
"optimized_document.pdf",
optimization_type="aggressive"
)
# Batch processing
result = processor.batch_ocr_pdfs(
"scanned_folder/",
"searchable_folder/"
)
```
## ποΈ Architecture
DocForge is built with a clean, modular architecture:
```
docforge/
βββ core/ # Core processing engine
βββ pdf/ # PDF operations (proven implementations)
βββ cli/ # Command-line interface
βββ utils/ # Shared utilities
βββ main.py # CLI entry point
```
## π Available Commands
| Command | Description |
|---------|-------------|
| `enhanced-ocr` | OCR with advanced performance optimization |
| `enhanced-batch-ocr` | Batch OCR with intelligent performance optimization |
| `ocr` | Standard OCR processing |
| `batch-ocr` | Standard batch OCR processing |
| `optimize` | PDF optimization |
| `pdf-to-word` | PDF to Word conversion |
| `split-pdf` | Split PDF documents |
| `benchmark` | Run performance benchmarks |
| `perf-stats` | Display performance statistics |
| `test-rich` | Test Rich CLI interface |
## π§ͺ Examples
Run the examples to see DocForge in action:
```bash
# Basic usage examples (if you have example files)
python examples/basic_usage.py
# Test the CLI interface
docforge test-rich
# Test error handling
docforge test-errors
# Test validation system
docforge test-validation
```
## π€ Contributing
We welcome contributions! The modular architecture makes it easy to add new features.
1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests if applicable
5. Submit a pull request
## πΊοΈ Roadmap
- β
Core PDF processing with proven implementations
- β
OCR and optimization capabilities
- β
Command-line interface
- β
Comprehensive documentation
- π Word document processing (Word β PDF conversion)
- π¨ Modern GUI interface
- π Performance optimizations
- π Excel and PowerPoint support
- π€ AI-powered document analysis
- π Web interface
## π License
This project is licensed under the MIT License.
## π Acknowledgments
Built with proven implementations and enhanced with modern architecture for the open source community.
---
β **If DocForge helped you, please give it a star!** β
*Built by craftsmen, for craftsmen.* π¨
Raw data
{
"_id": null,
"home_page": null,
"name": "docforge",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "Oscar Song <oscar2song@gmail.com>",
"keywords": "pdf, ocr, document, processing, optimization",
"author": null,
"author_email": "Oscar Song <oscar2song@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/a8/6e/42ae2199b8df49765285aa752664b05042967b7a1ccd68a5aeb79b1d8e62/docforge-0.1.0.tar.gz",
"platform": null,
"description": "# DocForge \ud83d\udd28\r\n\r\n**Forge perfect documents from any format with precision, power, and simplicity.**\r\n\r\nDocForge is a comprehensive document processing toolkit built on proven implementations with a modern modular architecture. Born from real-world needs and battle-tested algorithms, DocForge transforms how you work with documents.\r\n\r\n## \u2728 Features\r\n\r\n- \ud83d\udd0d **OCR Processing**: Convert scanned PDFs to searchable documents with precision\r\n- \ud83d\udddc\ufe0f **Smart Optimization**: Reduce file sizes without compromising quality \r\n- \u2699\ufe0f **Batch Processing**: Handle hundreds of documents efficiently\r\n- \ud83d\udd27 **Document Analysis**: Extract insights and metadata\r\n- \ud83c\udfaf **Modular Design**: Use only what you need, extend easily\r\n\r\n## \ud83d\ude80 Why DocForge?\r\n\r\n- **Battle-tested OCR algorithms** with Windows compatibility\r\n- **Advanced optimization techniques** from real-world usage\r\n- **Memory-efficient batch processing** for large-scale operations\r\n- **Clean, modular codebase** that's easy to understand and extend\r\n- **Comprehensive error handling** and logging\r\n- **Both programmatic API and command-line interface**\r\n\r\n## \ud83d\udce6 Installation\r\n\r\n### Option 1: Install from PyPI (when available)\r\n```bash\r\npip install docforge\r\n```\r\n\r\n### Option 2: Install from source\r\n```bash\r\ngit clone https://github.com/oscar2song/docforge.git\r\ncd docforge\r\npip install -e .\r\n```\r\n\r\n### System Dependencies\r\n\r\n**Ubuntu/Debian:**\r\n```bash\r\nsudo apt-get install tesseract-ocr poppler-utils\r\n```\r\n\r\n**macOS:**\r\n```bash\r\nbrew install tesseract poppler\r\n```\r\n\r\n**Windows:**\r\nDownload Tesseract from: https://github.com/tesseract-ocr/tesseract\r\n\r\n## \ud83c\udfaf Quick Start\r\n\r\n### Command Line Interface\r\n\r\nAfter installation, use the `docforge` command:\r\n\r\n```bash\r\n# Get help\r\ndocforge --help\r\n\r\n# OCR a scanned PDF\r\ndocforge enhanced-ocr -i scanned_document.pdf -o searchable_document.pdf\r\n\r\n# Batch OCR processing\r\ndocforge enhanced-batch-ocr -i scanned_folder/ -o searchable_folder/\r\n\r\n# Standard OCR processing\r\ndocforge ocr -i document.pdf -o output.pdf --language eng\r\n\r\n# Batch optimization\r\ndocforge batch-ocr -i input_folder/ -o output_folder/\r\n\r\n# Test the interface\r\ndocforge test-rich\r\n\r\n# Run performance benchmarks\r\ndocforge benchmark --test-files document.pdf\r\n```\r\n\r\n### Programmatic API\r\n\r\n```python\r\nfrom docforge import DocumentProcessor\r\n\r\n# Initialize the processor\r\nprocessor = DocumentProcessor(verbose=True)\r\n\r\n# OCR a scanned PDF\r\nresult = processor.ocr_pdf(\r\n \"scanned_document.pdf\",\r\n \"searchable_document.pdf\", \r\n language='eng'\r\n)\r\n\r\n# Optimize PDF size\r\nresult = processor.optimize_pdf(\r\n \"large_document.pdf\",\r\n \"optimized_document.pdf\",\r\n optimization_type=\"aggressive\"\r\n)\r\n\r\n# Batch processing\r\nresult = processor.batch_ocr_pdfs(\r\n \"scanned_folder/\",\r\n \"searchable_folder/\"\r\n)\r\n```\r\n\r\n## \ud83c\udfd7\ufe0f Architecture\r\n\r\nDocForge is built with a clean, modular architecture:\r\n\r\n```\r\ndocforge/\r\n\u251c\u2500\u2500 core/ # Core processing engine\r\n\u251c\u2500\u2500 pdf/ # PDF operations (proven implementations) \r\n\u251c\u2500\u2500 cli/ # Command-line interface\r\n\u251c\u2500\u2500 utils/ # Shared utilities\r\n\u2514\u2500\u2500 main.py # CLI entry point\r\n```\r\n\r\n## \ud83d\udccb Available Commands\r\n\r\n| Command | Description |\r\n|---------|-------------|\r\n| `enhanced-ocr` | OCR with advanced performance optimization |\r\n| `enhanced-batch-ocr` | Batch OCR with intelligent performance optimization |\r\n| `ocr` | Standard OCR processing |\r\n| `batch-ocr` | Standard batch OCR processing |\r\n| `optimize` | PDF optimization |\r\n| `pdf-to-word` | PDF to Word conversion |\r\n| `split-pdf` | Split PDF documents |\r\n| `benchmark` | Run performance benchmarks |\r\n| `perf-stats` | Display performance statistics |\r\n| `test-rich` | Test Rich CLI interface |\r\n\r\n## \ud83e\uddea Examples\r\n\r\nRun the examples to see DocForge in action:\r\n\r\n```bash\r\n# Basic usage examples (if you have example files)\r\npython examples/basic_usage.py\r\n\r\n# Test the CLI interface\r\ndocforge test-rich\r\n\r\n# Test error handling\r\ndocforge test-errors\r\n\r\n# Test validation system \r\ndocforge test-validation\r\n```\r\n\r\n## \ud83e\udd1d Contributing\r\n\r\nWe welcome contributions! The modular architecture makes it easy to add new features.\r\n\r\n1. Fork the repository\r\n2. Create a feature branch\r\n3. Make your changes\r\n4. Add tests if applicable\r\n5. Submit a pull request\r\n\r\n## \ud83d\uddfa\ufe0f Roadmap\r\n\r\n- \u2705 Core PDF processing with proven implementations\r\n- \u2705 OCR and optimization capabilities \r\n- \u2705 Command-line interface\r\n- \u2705 Comprehensive documentation\r\n- \ud83d\udcc4 Word document processing (Word \u2194 PDF conversion)\r\n- \ud83c\udfa8 Modern GUI interface\r\n- \ud83d\ude80 Performance optimizations\r\n- \ud83d\udcca Excel and PowerPoint support\r\n- \ud83e\udd16 AI-powered document analysis\r\n- \ud83c\udf10 Web interface\r\n\r\n## \ud83d\udcc4 License\r\n\r\nThis project is licensed under the MIT License.\r\n\r\n## \ud83c\udfc6 Acknowledgments\r\n\r\nBuilt with proven implementations and enhanced with modern architecture for the open source community.\r\n\r\n---\r\n\r\n\u2b50 **If DocForge helped you, please give it a star!** \u2b50\r\n\r\n*Built by craftsmen, for craftsmen.* \ud83d\udd28\r\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Forge perfect documents from any format with precision, power, and simplicity",
"version": "0.1.0",
"project_urls": {
"Bug Tracker": "https://github.com/oscar2song/docforge/issues",
"Changelog": "https://github.com/oscar2song/docforge/blob/main/CHANGELOG.md",
"Documentation": "https://oscar2song.github.io/docforge",
"Homepage": "https://github.com/oscar2song/docforge",
"Repository": "https://github.com/oscar2song/docforge.git"
},
"split_keywords": [
"pdf",
" ocr",
" document",
" processing",
" optimization"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "729c161b79acbd08c7fc7b82268dba7a79bbcbde8a5404291b2acdcbd2080f32",
"md5": "8d0e645d7b7b6898b23cc2723b5c5088",
"sha256": "28194596e3dac1ae07affd3e443f40bee3281ebcf4d179a1d737bd394333f8e1"
},
"downloads": -1,
"filename": "docforge-0.1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "8d0e645d7b7b6898b23cc2723b5c5088",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 83145,
"upload_time": "2025-07-13T22:29:46",
"upload_time_iso_8601": "2025-07-13T22:29:46.139191Z",
"url": "https://files.pythonhosted.org/packages/72/9c/161b79acbd08c7fc7b82268dba7a79bbcbde8a5404291b2acdcbd2080f32/docforge-0.1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "a86e42ae2199b8df49765285aa752664b05042967b7a1ccd68a5aeb79b1d8e62",
"md5": "c48d61a539b46a3d219b326d90a14528",
"sha256": "36e6e2603953995b6c98ae23a1a4955ed4c231c5bbc19e5ae94a911175d2b40d"
},
"downloads": -1,
"filename": "docforge-0.1.0.tar.gz",
"has_sig": false,
"md5_digest": "c48d61a539b46a3d219b326d90a14528",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 80318,
"upload_time": "2025-07-13T22:29:47",
"upload_time_iso_8601": "2025-07-13T22:29:47.197363Z",
"url": "https://files.pythonhosted.org/packages/a8/6e/42ae2199b8df49765285aa752664b05042967b7a1ccd68a5aeb79b1d8e62/docforge-0.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-13 22:29:47",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "oscar2song",
"github_project": "docforge",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "PyMuPDF",
"specs": [
[
">=",
"1.23.0"
]
]
},
{
"name": "python-docx",
"specs": [
[
">=",
"0.8.11"
]
]
},
{
"name": "Pillow",
"specs": [
[
">=",
"8.0.0"
]
]
},
{
"name": "pytesseract",
"specs": [
[
">=",
"0.3.10"
]
]
},
{
"name": "pdf2image",
"specs": [
[
">=",
"1.16.0"
]
]
},
{
"name": "reportlab",
"specs": [
[
">=",
"3.6.0"
]
]
},
{
"name": "pdf2docx",
"specs": [
[
">=",
"0.5.6"
]
]
},
{
"name": "python-magic",
"specs": [
[
">=",
"0.4.27"
]
]
},
{
"name": "ocrmypdf",
"specs": [
[
">=",
"13.0.0"
]
]
},
{
"name": "rich",
"specs": [
[
">=",
"13.0.0"
]
]
}
],
"lcname": "docforge"
}