# Sparrow Parse
[](https://badge.fury.io/py/sparrow-parse)
[](https://www.python.org/downloads/)
[](https://www.gnu.org/licenses/gpl-3.0)
A powerful Python library for parsing and extracting structured information from documents using Vision Language Models (VLMs). Part of the [Sparrow](https://github.com/katanaml/sparrow) ecosystem for intelligent document processing.
## ✨ Features
- 🔍 **Document Data Extraction**: Extract structured data from invoices, forms, tables, and complex documents
- 🤖 **Multiple Backend Support**: MLX (Apple Silicon), Ollama, Docker, Hugging Face Cloud GPU, and local GPU inference
- 📄 **Multi-format Support**: Images (PNG, JPG, JPEG) and multi-page PDFs
- 🎯 **Schema Validation**: JSON schema-based extraction with automatic validation
- 📊 **Table Processing**: Specialized table detection and extraction capabilities
- 🖼️ **Image Annotation**: Bounding box annotations for extracted data
- 💬 **Text Instructions**: Support for instruction-based text processing
- ⚡ **Optimized Processing**: Image cropping, resizing, and preprocessing capabilities
## 🚀 Quick Start
### Installation
To run with MLX on macOS Silicon:
```bash
pip install sparrow-parse[mlx]
```
To run with Ollama on Linux/Windows:
```
pip install sparrow-parse
```
**Additional Requirements:**
- For PDF processing: `brew install poppler` (macOS) or `apt-get install poppler-utils` (Linux)
- For MLX backend: Apple Silicon Mac required
- For Hugging Face: Valid HF token with GPU access
### Basic Usage
```python
from sparrow_parse.vllm.inference_factory import InferenceFactory
from sparrow_parse.extractors.vllm_extractor import VLLMExtractor
# Initialize extractor
extractor = VLLMExtractor()
# Configure backend (MLX example)
config = {
"method": "mlx",
"model_name": "mlx-community/Mistral-Small-3.1-24B-Instruct-2503-8bit"
}
# Create inference instance
factory = InferenceFactory(config)
model_inference_instance = factory.get_inference_instance()
# Prepare input data
input_data = [{
"file_path": "path/to/your/document.png",
"text_input": "retrieve [{\"field_name\": \"str\", \"amount\": 0}]. return response in JSON format"
}]
# Run inference
results, num_pages = extractor.run_inference(
model_inference_instance,
input_data,
debug=True
)
print(f"Extracted data: {results[0]}")
```
## 📖 Detailed Usage
### Backend Configuration
#### MLX Backend (Apple Silicon)
```python
config = {
"method": "mlx",
"model_name": "mlx-community/Qwen2.5-VL-72B-Instruct-4bit"
}
```
#### Ollama Backend
```python
config = {
"method": "ollama",
"model_name": "mistral-small3.2:24b-instruct-2506-q8_0"
}
```
#### Hugging Face Backend
```python
import os
config = {
"method": "huggingface",
"hf_space": "your-username/your-space",
"hf_token": os.getenv('HF_TOKEN')
}
```
#### Local GPU Backend
```python
config = {
"method": "local_gpu",
"device": "cuda",
"model_path": "path/to/model.pth"
}
```
### Input Data Formats
#### Document Processing
```python
input_data = [{
"file_path": "invoice.pdf",
"text_input": "extract invoice data: {\"invoice_number\": \"str\", \"total\": 0, \"date\": \"str\"}"
}]
```
#### Text-Only Processing
```python
input_data = [{
"file_path": None,
"text_input": "Summarize the key points about renewable energy."
}]
```
### Advanced Options
#### Table Extraction Only
```python
results, num_pages = extractor.run_inference(
model_inference_instance,
input_data,
tables_only=True # Extract only tables from document
)
```
#### Image Cropping
```python
results, num_pages = extractor.run_inference(
model_inference_instance,
input_data,
crop_size=60 # Crop 60 pixels from all borders
)
```
#### Bounding Box Annotations
```python
results, num_pages = extractor.run_inference(
model_inference_instance,
input_data,
apply_annotation=True # Include bounding box coordinates
)
```
#### Generic Data Extraction
```python
results, num_pages = extractor.run_inference(
model_inference_instance,
input_data,
generic_query=True # Extract all available data
)
```
## 🛠️ Utility Functions
### PDF Processing
```python
from sparrow_parse.helpers.pdf_optimizer import PDFOptimizer
pdf_optimizer = PDFOptimizer()
num_pages, output_files, temp_dir = pdf_optimizer.split_pdf_to_pages(
file_path="document.pdf",
debug_dir="./debug",
convert_to_images=True
)
```
### Image Optimization
```python
from sparrow_parse.helpers.image_optimizer import ImageOptimizer
image_optimizer = ImageOptimizer()
cropped_path = image_optimizer.crop_image_borders(
file_path="image.jpg",
temp_dir="./temp",
debug_dir="./debug",
crop_size=50
)
```
### Table Detection
```python
from sparrow_parse.processors.table_structure_processor import TableDetector
detector = TableDetector()
cropped_tables = detector.detect_tables(
file_path="document.png",
local=True,
debug=True
)
```
## 🎯 Use Cases & Examples
### Invoice Processing
```python
invoice_schema = {
"invoice_number": "str",
"date": "str",
"vendor_name": "str",
"total_amount": 0,
"line_items": [{
"description": "str",
"quantity": 0,
"price": 0.0
}]
}
input_data = [{
"file_path": "invoice.pdf",
"text_input": f"extract invoice data: {json.dumps(invoice_schema)}"
}]
```
### Financial Tables
```python
table_schema = [{
"instrument_name": "str",
"valuation": 0,
"currency": "str or null"
}]
input_data = [{
"file_path": "financial_report.png",
"text_input": f"retrieve {json.dumps(table_schema)}. return response in JSON format"
}]
```
### Form Processing
```python
form_schema = {
"applicant_name": "str",
"application_date": "str",
"fields": [{
"field_name": "str",
"field_value": "str or null"
}]
}
```
## ⚙️ Configuration Options
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `tables_only` | bool | False | Extract only tables from documents |
| `generic_query` | bool | False | Extract all available data without schema |
| `crop_size` | int | None | Pixels to crop from image borders |
| `apply_annotation` | bool | False | Include bounding box coordinates |
| `debug_dir` | str | None | Directory to save debug images |
| `debug` | bool | False | Enable debug logging |
| `mode` | str | None | Set to "static" for mock responses |
## 🔧 Troubleshooting
### Common Issues
**Import Errors:**
```bash
# For MLX backend on non-Apple Silicon
pip install sparrow-parse --no-deps
pip install -r requirements.txt --exclude mlx-vlm
# For missing poppler
brew install poppler # macOS
sudo apt-get install poppler-utils # Ubuntu/Debian
```
**Memory Issues:**
- Use smaller models or reduce image resolution
- Enable image cropping to reduce processing load
- Process single pages instead of entire PDFs
**Model Loading Errors:**
- Verify model name and availability
- Check HF token permissions for private models
- Ensure sufficient disk space for model downloads
### Performance Tips
- **Image Size**: Resize large images before processing
- **Batch Processing**: Process multiple pages together when possible
- **Model Selection**: Choose appropriate model size for your hardware
- **Caching**: Models are cached after first load
## 📚 API Reference
### VLLMExtractor Class
```python
class VLLMExtractor:
def run_inference(
self,
model_inference_instance,
input_data: List[Dict],
tables_only: bool = False,
generic_query: bool = False,
crop_size: Optional[int] = None,
apply_annotation: bool = False,
debug_dir: Optional[str] = None,
debug: bool = False,
mode: Optional[str] = None
) -> Tuple[List[str], int]
```
### InferenceFactory Class
```python
class InferenceFactory:
def __init__(self, config: Dict)
def get_inference_instance(self) -> ModelInference
```
## 🏗️ Development
### Building from Source
```bash
# Clone repository
git clone https://github.com/katanaml/sparrow.git
cd sparrow/sparrow-data/parse
# Create virtual environment
python -m venv .env_sparrow_parse
source .env_sparrow_parse/bin/activate # Linux/Mac
# or
.env_sparrow_parse\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txt
# Build package
pip install setuptools wheel
python setup.py sdist bdist_wheel
# Install locally
pip install -e .
```
### Running Tests
```bash
python -m pytest tests/
```
## 📄 Supported File Formats
| Format | Extension | Multi-page | Notes |
|--------|-----------|------------|-------|
| PNG | .png | ❌ | Recommended for tables/forms |
| JPEG | .jpg, .jpeg | ❌ | Good for photos/scanned docs |
| PDF | .pdf | ✅ | Automatically split into pages |
## 🤝 Contributing
We welcome contributions! Please see our [Contributing Guidelines](https://github.com/katanaml/sparrow/blob/main/CONTRIBUTING.md) for details.
## 📞 Support
- 📖 [Documentation](https://github.com/katanaml/sparrow)
- 🐛 [Issue Tracker](https://github.com/katanaml/sparrow/issues)
- 💼 [Professional Services](mailto:abaranovskis@redsamuraiconsulting.com)
## 📜 License
Licensed under the GPL 3.0. Copyright 2020-2025 Katana ML, Andrej Baranovskij.
**Commercial Licensing:** Free for organizations with revenue under $5M USD annually. [Contact us](mailto:abaranovskis@redsamuraiconsulting.com) for commercial licensing options.
## 👥 Authors
- **[Katana ML](https://katanaml.io)**
- **[Andrej Baranovskij](https://github.com/abaranovskis-redsamurai)**
---
⭐ **Star us on [GitHub](https://github.com/katanaml/sparrow)** if you find Sparrow Parse useful!
Raw data
{
"_id": null,
"home_page": "https://github.com/katanaml/sparrow/tree/main/sparrow-data/parse",
"name": "sparrow-parse",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": "llm, vllm, ocr, vision",
"author": "Andrej Baranovskij",
"author_email": "andrejus.baranovskis@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/21/35/5a56dc2839efec8365c63425ce05dd2631ef2d586b84ae16470f32069c9c/sparrow_parse-1.1.6.tar.gz",
"platform": null,
"description": "# Sparrow Parse\n\n[](https://badge.fury.io/py/sparrow-parse)\n[](https://www.python.org/downloads/)\n[](https://www.gnu.org/licenses/gpl-3.0)\n\nA powerful Python library for parsing and extracting structured information from documents using Vision Language Models (VLMs). Part of the [Sparrow](https://github.com/katanaml/sparrow) ecosystem for intelligent document processing.\n\n## \u2728 Features\n\n- \ud83d\udd0d **Document Data Extraction**: Extract structured data from invoices, forms, tables, and complex documents\n- \ud83e\udd16 **Multiple Backend Support**: MLX (Apple Silicon), Ollama, Docker, Hugging Face Cloud GPU, and local GPU inference\n- \ud83d\udcc4 **Multi-format Support**: Images (PNG, JPG, JPEG) and multi-page PDFs\n- \ud83c\udfaf **Schema Validation**: JSON schema-based extraction with automatic validation\n- \ud83d\udcca **Table Processing**: Specialized table detection and extraction capabilities\n- \ud83d\uddbc\ufe0f **Image Annotation**: Bounding box annotations for extracted data\n- \ud83d\udcac **Text Instructions**: Support for instruction-based text processing\n- \u26a1 **Optimized Processing**: Image cropping, resizing, and preprocessing capabilities\n\n## \ud83d\ude80 Quick Start\n\n### Installation\n\nTo run with MLX on macOS Silicon:\n\n```bash\npip install sparrow-parse[mlx]\n```\n\nTo run with Ollama on Linux/Windows:\n\n```\npip install sparrow-parse\n```\n\n**Additional Requirements:**\n- For PDF processing: `brew install poppler` (macOS) or `apt-get install poppler-utils` (Linux)\n- For MLX backend: Apple Silicon Mac required\n- For Hugging Face: Valid HF token with GPU access\n\n### Basic Usage\n\n```python\nfrom sparrow_parse.vllm.inference_factory import InferenceFactory\nfrom sparrow_parse.extractors.vllm_extractor import VLLMExtractor\n\n# Initialize extractor\nextractor = VLLMExtractor()\n\n# Configure backend (MLX example)\nconfig = {\n \"method\": \"mlx\",\n \"model_name\": \"mlx-community/Mistral-Small-3.1-24B-Instruct-2503-8bit\"\n}\n\n# Create inference instance\nfactory = InferenceFactory(config)\nmodel_inference_instance = factory.get_inference_instance()\n\n# Prepare input data\ninput_data = [{\n \"file_path\": \"path/to/your/document.png\",\n \"text_input\": \"retrieve [{\\\"field_name\\\": \\\"str\\\", \\\"amount\\\": 0}]. return response in JSON format\"\n}]\n\n# Run inference\nresults, num_pages = extractor.run_inference(\n model_inference_instance, \n input_data,\n debug=True\n)\n\nprint(f\"Extracted data: {results[0]}\")\n```\n\n## \ud83d\udcd6 Detailed Usage\n\n### Backend Configuration\n\n#### MLX Backend (Apple Silicon)\n```python\nconfig = {\n \"method\": \"mlx\",\n \"model_name\": \"mlx-community/Qwen2.5-VL-72B-Instruct-4bit\"\n}\n```\n\n#### Ollama Backend\n```python\nconfig = {\n \"method\": \"ollama\",\n \"model_name\": \"mistral-small3.2:24b-instruct-2506-q8_0\"\n}\n```\n\n#### Hugging Face Backend\n```python\nimport os\nconfig = {\n \"method\": \"huggingface\",\n \"hf_space\": \"your-username/your-space\",\n \"hf_token\": os.getenv('HF_TOKEN')\n}\n```\n\n#### Local GPU Backend\n```python\nconfig = {\n \"method\": \"local_gpu\",\n \"device\": \"cuda\",\n \"model_path\": \"path/to/model.pth\"\n}\n```\n\n### Input Data Formats\n\n#### Document Processing\n```python\ninput_data = [{\n \"file_path\": \"invoice.pdf\",\n \"text_input\": \"extract invoice data: {\\\"invoice_number\\\": \\\"str\\\", \\\"total\\\": 0, \\\"date\\\": \\\"str\\\"}\"\n}]\n```\n\n#### Text-Only Processing\n```python\ninput_data = [{\n \"file_path\": None,\n \"text_input\": \"Summarize the key points about renewable energy.\"\n}]\n```\n\n### Advanced Options\n\n#### Table Extraction Only\n```python\nresults, num_pages = extractor.run_inference(\n model_inference_instance,\n input_data,\n tables_only=True # Extract only tables from document\n)\n```\n\n#### Image Cropping\n```python\nresults, num_pages = extractor.run_inference(\n model_inference_instance,\n input_data,\n crop_size=60 # Crop 60 pixels from all borders\n)\n```\n\n#### Bounding Box Annotations\n```python\nresults, num_pages = extractor.run_inference(\n model_inference_instance,\n input_data,\n apply_annotation=True # Include bounding box coordinates\n)\n```\n\n#### Generic Data Extraction\n```python\nresults, num_pages = extractor.run_inference(\n model_inference_instance,\n input_data,\n generic_query=True # Extract all available data\n)\n```\n\n## \ud83d\udee0\ufe0f Utility Functions\n\n### PDF Processing\n```python\nfrom sparrow_parse.helpers.pdf_optimizer import PDFOptimizer\n\npdf_optimizer = PDFOptimizer()\nnum_pages, output_files, temp_dir = pdf_optimizer.split_pdf_to_pages(\n file_path=\"document.pdf\",\n debug_dir=\"./debug\",\n convert_to_images=True\n)\n```\n\n### Image Optimization\n```python\nfrom sparrow_parse.helpers.image_optimizer import ImageOptimizer\n\nimage_optimizer = ImageOptimizer()\ncropped_path = image_optimizer.crop_image_borders(\n file_path=\"image.jpg\",\n temp_dir=\"./temp\",\n debug_dir=\"./debug\",\n crop_size=50\n)\n```\n\n### Table Detection\n```python\nfrom sparrow_parse.processors.table_structure_processor import TableDetector\n\ndetector = TableDetector()\ncropped_tables = detector.detect_tables(\n file_path=\"document.png\",\n local=True,\n debug=True\n)\n```\n\n## \ud83c\udfaf Use Cases & Examples\n\n### Invoice Processing\n```python\ninvoice_schema = {\n \"invoice_number\": \"str\",\n \"date\": \"str\", \n \"vendor_name\": \"str\",\n \"total_amount\": 0,\n \"line_items\": [{\n \"description\": \"str\",\n \"quantity\": 0,\n \"price\": 0.0\n }]\n}\n\ninput_data = [{\n \"file_path\": \"invoice.pdf\",\n \"text_input\": f\"extract invoice data: {json.dumps(invoice_schema)}\"\n}]\n```\n\n### Financial Tables\n```python\ntable_schema = [{\n \"instrument_name\": \"str\",\n \"valuation\": 0,\n \"currency\": \"str or null\"\n}]\n\ninput_data = [{\n \"file_path\": \"financial_report.png\", \n \"text_input\": f\"retrieve {json.dumps(table_schema)}. return response in JSON format\"\n}]\n```\n\n### Form Processing\n```python\nform_schema = {\n \"applicant_name\": \"str\",\n \"application_date\": \"str\",\n \"fields\": [{\n \"field_name\": \"str\",\n \"field_value\": \"str or null\"\n }]\n}\n```\n\n## \u2699\ufe0f Configuration Options\n\n| Parameter | Type | Default | Description |\n|-----------|------|---------|-------------|\n| `tables_only` | bool | False | Extract only tables from documents |\n| `generic_query` | bool | False | Extract all available data without schema |\n| `crop_size` | int | None | Pixels to crop from image borders |\n| `apply_annotation` | bool | False | Include bounding box coordinates |\n| `debug_dir` | str | None | Directory to save debug images |\n| `debug` | bool | False | Enable debug logging |\n| `mode` | str | None | Set to \"static\" for mock responses |\n\n## \ud83d\udd27 Troubleshooting\n\n### Common Issues\n\n**Import Errors:**\n```bash\n# For MLX backend on non-Apple Silicon\npip install sparrow-parse --no-deps\npip install -r requirements.txt --exclude mlx-vlm\n\n# For missing poppler\nbrew install poppler # macOS\nsudo apt-get install poppler-utils # Ubuntu/Debian\n```\n\n**Memory Issues:**\n- Use smaller models or reduce image resolution\n- Enable image cropping to reduce processing load\n- Process single pages instead of entire PDFs\n\n**Model Loading Errors:**\n- Verify model name and availability\n- Check HF token permissions for private models\n- Ensure sufficient disk space for model downloads\n\n### Performance Tips\n\n- **Image Size**: Resize large images before processing\n- **Batch Processing**: Process multiple pages together when possible\n- **Model Selection**: Choose appropriate model size for your hardware\n- **Caching**: Models are cached after first load\n\n## \ud83d\udcda API Reference\n\n### VLLMExtractor Class\n\n```python\nclass VLLMExtractor:\n def run_inference(\n self,\n model_inference_instance,\n input_data: List[Dict],\n tables_only: bool = False,\n generic_query: bool = False, \n crop_size: Optional[int] = None,\n apply_annotation: bool = False,\n debug_dir: Optional[str] = None,\n debug: bool = False,\n mode: Optional[str] = None\n ) -> Tuple[List[str], int]\n```\n\n### InferenceFactory Class\n\n```python\nclass InferenceFactory:\n def __init__(self, config: Dict)\n def get_inference_instance(self) -> ModelInference\n```\n\n## \ud83c\udfd7\ufe0f Development\n\n### Building from Source\n\n```bash\n# Clone repository\ngit clone https://github.com/katanaml/sparrow.git\ncd sparrow/sparrow-data/parse\n\n# Create virtual environment\npython -m venv .env_sparrow_parse\nsource .env_sparrow_parse/bin/activate # Linux/Mac\n# or\n.env_sparrow_parse\\Scripts\\activate # Windows\n\n# Install dependencies\npip install -r requirements.txt\n\n# Build package\npip install setuptools wheel\npython setup.py sdist bdist_wheel\n\n# Install locally\npip install -e .\n```\n\n### Running Tests\n\n```bash\npython -m pytest tests/\n```\n\n## \ud83d\udcc4 Supported File Formats\n\n| Format | Extension | Multi-page | Notes |\n|--------|-----------|------------|-------|\n| PNG | .png | \u274c | Recommended for tables/forms |\n| JPEG | .jpg, .jpeg | \u274c | Good for photos/scanned docs |\n| PDF | .pdf | \u2705 | Automatically split into pages |\n\n## \ud83e\udd1d Contributing\n\nWe welcome contributions! Please see our [Contributing Guidelines](https://github.com/katanaml/sparrow/blob/main/CONTRIBUTING.md) for details.\n\n## \ud83d\udcde Support\n\n- \ud83d\udcd6 [Documentation](https://github.com/katanaml/sparrow)\n- \ud83d\udc1b [Issue Tracker](https://github.com/katanaml/sparrow/issues)\n- \ud83d\udcbc [Professional Services](mailto:abaranovskis@redsamuraiconsulting.com)\n\n## \ud83d\udcdc License\n\nLicensed under the GPL 3.0. Copyright 2020-2025 Katana ML, Andrej Baranovskij.\n\n**Commercial Licensing:** Free for organizations with revenue under $5M USD annually. [Contact us](mailto:abaranovskis@redsamuraiconsulting.com) for commercial licensing options.\n\n## \ud83d\udc65 Authors\n\n- **[Katana ML](https://katanaml.io)**\n- **[Andrej Baranovskij](https://github.com/abaranovskis-redsamurai)**\n\n---\n\n\u2b50 **Star us on [GitHub](https://github.com/katanaml/sparrow)** if you find Sparrow Parse useful!\n",
"bugtrack_url": null,
"license": null,
"summary": "Sparrow Parse is a Python package (part of Sparrow) for parsing and extracting information from documents.",
"version": "1.1.6",
"project_urls": {
"Homepage": "https://github.com/katanaml/sparrow/tree/main/sparrow-data/parse",
"Repository": "https://github.com/katanaml/sparrow"
},
"split_keywords": [
"llm",
" vllm",
" ocr",
" vision"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "4de4af55372716c1f81197935d2e3edd1db80bfcb478448122da5f4c29dabb3a",
"md5": "53c9d23808d37bfbeaf6947c04298486",
"sha256": "dc3ea92c137c5344339a0c65b4d7c6440ec897c621195469589224d89dc576d3"
},
"downloads": -1,
"filename": "sparrow_parse-1.1.6-py3-none-any.whl",
"has_sig": false,
"md5_digest": "53c9d23808d37bfbeaf6947c04298486",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 29344,
"upload_time": "2025-10-20T11:45:34",
"upload_time_iso_8601": "2025-10-20T11:45:34.602940Z",
"url": "https://files.pythonhosted.org/packages/4d/e4/af55372716c1f81197935d2e3edd1db80bfcb478448122da5f4c29dabb3a/sparrow_parse-1.1.6-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "21355a56dc2839efec8365c63425ce05dd2631ef2d586b84ae16470f32069c9c",
"md5": "ec6efddcb1cc4a99cf3c5ab36def3a0f",
"sha256": "036706f000980d8ec41aee884037adf04c8218b8ae624c6d95d25c40adc902d1"
},
"downloads": -1,
"filename": "sparrow_parse-1.1.6.tar.gz",
"has_sig": false,
"md5_digest": "ec6efddcb1cc4a99cf3c5ab36def3a0f",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 26592,
"upload_time": "2025-10-20T11:45:35",
"upload_time_iso_8601": "2025-10-20T11:45:35.758327Z",
"url": "https://files.pythonhosted.org/packages/21/35/5a56dc2839efec8365c63425ce05dd2631ef2d586b84ae16470f32069c9c/sparrow_parse-1.1.6.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-20 11:45:35",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "katanaml",
"github_project": "sparrow",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "sparrow-parse"
}