docforge

Name	docforge JSON
Version	0.1.0 JSON
	download
home_page	None
Summary	Forge perfect documents from any format with precision, power, and simplicity
upload_time	2025-07-13 22:29:47
maintainer	None
docs_url	None
author	None
requires_python	>=3.8
license	MIT
keywords	pdf ocr document processing optimization
VCS
bugtrack_url
requirements	PyMuPDF python-docx Pillow pytesseract pdf2image reportlab pdf2docx python-magic ocrmypdf rich
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # DocForge 🔨

**Forge perfect documents from any format with precision, power, and simplicity.**

DocForge is a comprehensive document processing toolkit built on proven implementations with a modern modular architecture. Born from real-world needs and battle-tested algorithms, DocForge transforms how you work with documents.

## ✨ Features

- 🔍 **OCR Processing**: Convert scanned PDFs to searchable documents with precision
- 🗜️ **Smart Optimization**: Reduce file sizes without compromising quality  
- ⚙️ **Batch Processing**: Handle hundreds of documents efficiently
- 🔧 **Document Analysis**: Extract insights and metadata
- 🎯 **Modular Design**: Use only what you need, extend easily

## 🚀 Why DocForge?

- **Battle-tested OCR algorithms** with Windows compatibility
- **Advanced optimization techniques** from real-world usage
- **Memory-efficient batch processing** for large-scale operations
- **Clean, modular codebase** that's easy to understand and extend
- **Comprehensive error handling** and logging
- **Both programmatic API and command-line interface**

## 📦 Installation

### Option 1: Install from PyPI (when available)
```bash
pip install docforge
```

### Option 2: Install from source
```bash
git clone https://github.com/oscar2song/docforge.git
cd docforge
pip install -e .
```

### System Dependencies

**Ubuntu/Debian:**
```bash
sudo apt-get install tesseract-ocr poppler-utils
```

**macOS:**
```bash
brew install tesseract poppler
```

**Windows:**
Download Tesseract from: https://github.com/tesseract-ocr/tesseract

## 🎯 Quick Start

### Command Line Interface

After installation, use the `docforge` command:

```bash
# Get help
docforge --help

# OCR a scanned PDF
docforge enhanced-ocr -i scanned_document.pdf -o searchable_document.pdf

# Batch OCR processing
docforge enhanced-batch-ocr -i scanned_folder/ -o searchable_folder/

# Standard OCR processing
docforge ocr -i document.pdf -o output.pdf --language eng

# Batch optimization
docforge batch-ocr -i input_folder/ -o output_folder/

# Test the interface
docforge test-rich

# Run performance benchmarks
docforge benchmark --test-files document.pdf
```

### Programmatic API

```python
from docforge import DocumentProcessor

# Initialize the processor
processor = DocumentProcessor(verbose=True)

# OCR a scanned PDF
result = processor.ocr_pdf(
    "scanned_document.pdf",
    "searchable_document.pdf", 
    language='eng'
)

# Optimize PDF size
result = processor.optimize_pdf(
    "large_document.pdf",
    "optimized_document.pdf",
    optimization_type="aggressive"
)

# Batch processing
result = processor.batch_ocr_pdfs(
    "scanned_folder/",
    "searchable_folder/"
)
```

## 🏗️ Architecture

DocForge is built with a clean, modular architecture:

```
docforge/
├── core/           # Core processing engine
├── pdf/            # PDF operations (proven implementations)  
├── cli/            # Command-line interface
├── utils/          # Shared utilities
└── main.py         # CLI entry point
```

## 📋 Available Commands

| Command | Description |
|---------|-------------|
| `enhanced-ocr` | OCR with advanced performance optimization |
| `enhanced-batch-ocr` | Batch OCR with intelligent performance optimization |
| `ocr` | Standard OCR processing |
| `batch-ocr` | Standard batch OCR processing |
| `optimize` | PDF optimization |
| `pdf-to-word` | PDF to Word conversion |
| `split-pdf` | Split PDF documents |
| `benchmark` | Run performance benchmarks |
| `perf-stats` | Display performance statistics |
| `test-rich` | Test Rich CLI interface |

## 🧪 Examples

Run the examples to see DocForge in action:

```bash
# Basic usage examples (if you have example files)
python examples/basic_usage.py

# Test the CLI interface
docforge test-rich

# Test error handling
docforge test-errors

# Test validation system  
docforge test-validation
```

## 🤝 Contributing

We welcome contributions! The modular architecture makes it easy to add new features.

1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests if applicable
5. Submit a pull request

## 🗺️ Roadmap

- ✅ Core PDF processing with proven implementations
- ✅ OCR and optimization capabilities  
- ✅ Command-line interface
- ✅ Comprehensive documentation
- 📄 Word document processing (Word ↔ PDF conversion)
- 🎨 Modern GUI interface
- 🚀 Performance optimizations
- 📊 Excel and PowerPoint support
- 🤖 AI-powered document analysis
- 🌐 Web interface

## 📄 License

This project is licensed under the MIT License.

## 🏆 Acknowledgments

Built with proven implementations and enhanced with modern architecture for the open source community.

---

⭐ **If DocForge helped you, please give it a star!** ⭐

*Built by craftsmen, for craftsmen.* 🔨

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "docforge",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "Oscar Song <oscar2song@gmail.com>",
    "keywords": "pdf, ocr, document, processing, optimization",
    "author": null,
    "author_email": "Oscar Song <oscar2song@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/a8/6e/42ae2199b8df49765285aa752664b05042967b7a1ccd68a5aeb79b1d8e62/docforge-0.1.0.tar.gz",
    "platform": null,
    "description": "# DocForge \ud83d\udd28\r\n\r\n**Forge perfect documents from any format with precision, power, and simplicity.**\r\n\r\nDocForge is a comprehensive document processing toolkit built on proven implementations with a modern modular architecture. Born from real-world needs and battle-tested algorithms, DocForge transforms how you work with documents.\r\n\r\n## \u2728 Features\r\n\r\n- \ud83d\udd0d **OCR Processing**: Convert scanned PDFs to searchable documents with precision\r\n- \ud83d\udddc\ufe0f **Smart Optimization**: Reduce file sizes without compromising quality  \r\n- \u2699\ufe0f **Batch Processing**: Handle hundreds of documents efficiently\r\n- \ud83d\udd27 **Document Analysis**: Extract insights and metadata\r\n- \ud83c\udfaf **Modular Design**: Use only what you need, extend easily\r\n\r\n## \ud83d\ude80 Why DocForge?\r\n\r\n- **Battle-tested OCR algorithms** with Windows compatibility\r\n- **Advanced optimization techniques** from real-world usage\r\n- **Memory-efficient batch processing** for large-scale operations\r\n- **Clean, modular codebase** that's easy to understand and extend\r\n- **Comprehensive error handling** and logging\r\n- **Both programmatic API and command-line interface**\r\n\r\n## \ud83d\udce6 Installation\r\n\r\n### Option 1: Install from PyPI (when available)\r\n```bash\r\npip install docforge\r\n```\r\n\r\n### Option 2: Install from source\r\n```bash\r\ngit clone https://github.com/oscar2song/docforge.git\r\ncd docforge\r\npip install -e .\r\n```\r\n\r\n### System Dependencies\r\n\r\n**Ubuntu/Debian:**\r\n```bash\r\nsudo apt-get install tesseract-ocr poppler-utils\r\n```\r\n\r\n**macOS:**\r\n```bash\r\nbrew install tesseract poppler\r\n```\r\n\r\n**Windows:**\r\nDownload Tesseract from: https://github.com/tesseract-ocr/tesseract\r\n\r\n## \ud83c\udfaf Quick Start\r\n\r\n### Command Line Interface\r\n\r\nAfter installation, use the `docforge` command:\r\n\r\n```bash\r\n# Get help\r\ndocforge --help\r\n\r\n# OCR a scanned PDF\r\ndocforge enhanced-ocr -i scanned_document.pdf -o searchable_document.pdf\r\n\r\n# Batch OCR processing\r\ndocforge enhanced-batch-ocr -i scanned_folder/ -o searchable_folder/\r\n\r\n# Standard OCR processing\r\ndocforge ocr -i document.pdf -o output.pdf --language eng\r\n\r\n# Batch optimization\r\ndocforge batch-ocr -i input_folder/ -o output_folder/\r\n\r\n# Test the interface\r\ndocforge test-rich\r\n\r\n# Run performance benchmarks\r\ndocforge benchmark --test-files document.pdf\r\n```\r\n\r\n### Programmatic API\r\n\r\n```python\r\nfrom docforge import DocumentProcessor\r\n\r\n# Initialize the processor\r\nprocessor = DocumentProcessor(verbose=True)\r\n\r\n# OCR a scanned PDF\r\nresult = processor.ocr_pdf(\r\n    \"scanned_document.pdf\",\r\n    \"searchable_document.pdf\", \r\n    language='eng'\r\n)\r\n\r\n# Optimize PDF size\r\nresult = processor.optimize_pdf(\r\n    \"large_document.pdf\",\r\n    \"optimized_document.pdf\",\r\n    optimization_type=\"aggressive\"\r\n)\r\n\r\n# Batch processing\r\nresult = processor.batch_ocr_pdfs(\r\n    \"scanned_folder/\",\r\n    \"searchable_folder/\"\r\n)\r\n```\r\n\r\n## \ud83c\udfd7\ufe0f Architecture\r\n\r\nDocForge is built with a clean, modular architecture:\r\n\r\n```\r\ndocforge/\r\n\u251c\u2500\u2500 core/           # Core processing engine\r\n\u251c\u2500\u2500 pdf/            # PDF operations (proven implementations)  \r\n\u251c\u2500\u2500 cli/            # Command-line interface\r\n\u251c\u2500\u2500 utils/          # Shared utilities\r\n\u2514\u2500\u2500 main.py         # CLI entry point\r\n```\r\n\r\n## \ud83d\udccb Available Commands\r\n\r\n| Command | Description |\r\n|---------|-------------|\r\n| `enhanced-ocr` | OCR with advanced performance optimization |\r\n| `enhanced-batch-ocr` | Batch OCR with intelligent performance optimization |\r\n| `ocr` | Standard OCR processing |\r\n| `batch-ocr` | Standard batch OCR processing |\r\n| `optimize` | PDF optimization |\r\n| `pdf-to-word` | PDF to Word conversion |\r\n| `split-pdf` | Split PDF documents |\r\n| `benchmark` | Run performance benchmarks |\r\n| `perf-stats` | Display performance statistics |\r\n| `test-rich` | Test Rich CLI interface |\r\n\r\n## \ud83e\uddea Examples\r\n\r\nRun the examples to see DocForge in action:\r\n\r\n```bash\r\n# Basic usage examples (if you have example files)\r\npython examples/basic_usage.py\r\n\r\n# Test the CLI interface\r\ndocforge test-rich\r\n\r\n# Test error handling\r\ndocforge test-errors\r\n\r\n# Test validation system  \r\ndocforge test-validation\r\n```\r\n\r\n## \ud83e\udd1d Contributing\r\n\r\nWe welcome contributions! The modular architecture makes it easy to add new features.\r\n\r\n1. Fork the repository\r\n2. Create a feature branch\r\n3. Make your changes\r\n4. Add tests if applicable\r\n5. Submit a pull request\r\n\r\n## \ud83d\uddfa\ufe0f Roadmap\r\n\r\n- \u2705 Core PDF processing with proven implementations\r\n- \u2705 OCR and optimization capabilities  \r\n- \u2705 Command-line interface\r\n- \u2705 Comprehensive documentation\r\n- \ud83d\udcc4 Word document processing (Word \u2194 PDF conversion)\r\n- \ud83c\udfa8 Modern GUI interface\r\n- \ud83d\ude80 Performance optimizations\r\n- \ud83d\udcca Excel and PowerPoint support\r\n- \ud83e\udd16 AI-powered document analysis\r\n- \ud83c\udf10 Web interface\r\n\r\n## \ud83d\udcc4 License\r\n\r\nThis project is licensed under the MIT License.\r\n\r\n## \ud83c\udfc6 Acknowledgments\r\n\r\nBuilt with proven implementations and enhanced with modern architecture for the open source community.\r\n\r\n---\r\n\r\n\u2b50 **If DocForge helped you, please give it a star!** \u2b50\r\n\r\n*Built by craftsmen, for craftsmen.* \ud83d\udd28\r\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Forge perfect documents from any format with precision, power, and simplicity",
    "version": "0.1.0",
    "project_urls": {
        "Bug Tracker": "https://github.com/oscar2song/docforge/issues",
        "Changelog": "https://github.com/oscar2song/docforge/blob/main/CHANGELOG.md",
        "Documentation": "https://oscar2song.github.io/docforge",
        "Homepage": "https://github.com/oscar2song/docforge",
        "Repository": "https://github.com/oscar2song/docforge.git"
    },
    "split_keywords": [
        "pdf",
        " ocr",
        " document",
        " processing",
        " optimization"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "729c161b79acbd08c7fc7b82268dba7a79bbcbde8a5404291b2acdcbd2080f32",
                "md5": "8d0e645d7b7b6898b23cc2723b5c5088",
                "sha256": "28194596e3dac1ae07affd3e443f40bee3281ebcf4d179a1d737bd394333f8e1"
            },
            "downloads": -1,
            "filename": "docforge-0.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "8d0e645d7b7b6898b23cc2723b5c5088",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 83145,
            "upload_time": "2025-07-13T22:29:46",
            "upload_time_iso_8601": "2025-07-13T22:29:46.139191Z",
            "url": "https://files.pythonhosted.org/packages/72/9c/161b79acbd08c7fc7b82268dba7a79bbcbde8a5404291b2acdcbd2080f32/docforge-0.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "a86e42ae2199b8df49765285aa752664b05042967b7a1ccd68a5aeb79b1d8e62",
                "md5": "c48d61a539b46a3d219b326d90a14528",
                "sha256": "36e6e2603953995b6c98ae23a1a4955ed4c231c5bbc19e5ae94a911175d2b40d"
            },
            "downloads": -1,
            "filename": "docforge-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "c48d61a539b46a3d219b326d90a14528",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 80318,
            "upload_time": "2025-07-13T22:29:47",
            "upload_time_iso_8601": "2025-07-13T22:29:47.197363Z",
            "url": "https://files.pythonhosted.org/packages/a8/6e/42ae2199b8df49765285aa752664b05042967b7a1ccd68a5aeb79b1d8e62/docforge-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-13 22:29:47",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "oscar2song",
    "github_project": "docforge",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "PyMuPDF",
            "specs": [
                [
                    ">=",
                    "1.23.0"
                ]
            ]
        },
        {
            "name": "python-docx",
            "specs": [
                [
                    ">=",
                    "0.8.11"
                ]
            ]
        },
        {
            "name": "Pillow",
            "specs": [
                [
                    ">=",
                    "8.0.0"
                ]
            ]
        },
        {
            "name": "pytesseract",
            "specs": [
                [
                    ">=",
                    "0.3.10"
                ]
            ]
        },
        {
            "name": "pdf2image",
            "specs": [
                [
                    ">=",
                    "1.16.0"
                ]
            ]
        },
        {
            "name": "reportlab",
            "specs": [
                [
                    ">=",
                    "3.6.0"
                ]
            ]
        },
        {
            "name": "pdf2docx",
            "specs": [
                [
                    ">=",
                    "0.5.6"
                ]
            ]
        },
        {
            "name": "python-magic",
            "specs": [
                [
                    ">=",
                    "0.4.27"
                ]
            ]
        },
        {
            "name": "ocrmypdf",
            "specs": [
                [
                    ">=",
                    "13.0.0"
                ]
            ]
        },
        {
            "name": "rich",
            "specs": [
                [
                    ">=",
                    "13.0.0"
                ]
            ]
        }
    ],
    "lcname": "docforge"
}

None