# IndoxMiner
[![PyPI version](https://badge.fury.io/py/indoxminer.svg)](https://badge.fury.io/py/indoxminer)
[![License: MIT](https://img.shields.io/badge/License-AGPL-yellow.svg)](https://opensource.org/licenses/AGPL)
IndoxMiner is a powerful Python library that leverages Large Language Models (LLMs) to extract structured information from unstructured data sources including text, PDFs, and images. Using a flexible schema-based approach, it enables precise data extraction, validation, and transformation, making it ideal for automating document processing workflows.
## 🚀 Key Features
- **Multi-Format Support**: Extract data from text, PDFs, images, and scanned documents
- **Schema-Based Extraction**: Define custom schemas to specify exactly what data to extract
- **LLM Integration**: Seamless integration with OpenAI models for intelligent extraction
- **Validation & Type Safety**: Built-in validation rules and type-safe field definitions
- **Flexible Output**: Export to JSON, pandas DataFrames, or custom formats
- **Async Support**: Built for scalability with asynchronous processing capabilities
- **OCR Integration**: Multiple OCR engine options for image-based text extraction
- **High-Resolution Support**: Enhanced processing for high-quality PDFs
- **Error Handling**: Comprehensive error handling and validation reporting
## 📦 Installation
```bash
pip install indoxminer
```
## 🎯 Quick Start
### Basic Text Extraction
```python
from indoxminer import ExtractorSchema, Field, FieldType, ValidationRule, Extractor, OpenAi
# Initialize OpenAI extractor
llm_extractor = OpenAi(
api_key="your-api-key",
model="gpt-4-mini"
)
# Define extraction schema
schema = ExtractorSchema(
fields=[
Field(
name="product_name",
description="Product name",
field_type=FieldType.STRING,
rules=ValidationRule(min_length=2)
),
Field(
name="price",
description="Price in USD",
field_type=FieldType.FLOAT,
rules=ValidationRule(min_value=0)
)
]
)
# Create extractor and process text
extractor = Extractor(llm=llm_extractor, schema=schema)
text = """
MacBook Pro 16-inch with M2 chip
Price: $2,399.99
In stock: Yes
"""
# Extract and convert to DataFrame
result = await extractor.extract(text)
df = extractor.to_dataframe(result)
```
### PDF Processing
```python
from indoxminer import DocumentProcessor, ProcessingConfig
# Initialize processor with custom config
processor = DocumentProcessor(
files=["invoice.pdf"],
config=ProcessingConfig(
hi_res_pdf=True,
chunk_size=1000
)
)
# Process document
documents = processor.process()
# Extract structured data
schema = ExtractorSchema(
fields=[
Field(
name="bill_to",
description="Billing address",
field_type=FieldType.STRING
),
Field(
name="invoice_date",
description="Invoice date",
field_type=FieldType.DATE
),
Field(
name="total_amount",
description="Total amount in USD",
field_type=FieldType.FLOAT
)
]
)
results = await extractor.extract(documents)
```
### Image Processing with OCR
```python
# Configure OCR-enabled processor
config = ProcessingConfig(
ocr_enabled=True,
ocr_engine="easyocr", # or "tesseract", "paddle"
language="en"
)
processor = DocumentProcessor(
files=["receipt.jpg"],
config=config
)
# Process image and extract text
documents = processor.process()
```
## 🔧 Core Components
### ExtractorSchema
Defines the structure of data to be extracted:
- Field definitions
- Validation rules
- Output format specifications
```python
schema = ExtractorSchema(
fields=[...],
output_format="json"
)
```
### Field Types
Supported field types:
- `STRING`: Text data
- `INTEGER`: Whole numbers
- `FLOAT`: Decimal numbers
- `DATE`: Date values
- `BOOLEAN`: True/False values
- `LIST`: Arrays of values
- `DICT`: Nested objects
### Validation Rules
Available validation options:
- `min_length`/`max_length`: String length constraints
- `min_value`/`max_value`: Numeric bounds
- `pattern`: Regex patterns
- `required`: Required fields
- `custom`: Custom validation functions
## ⚙️ Configuration Options
### ProcessingConfig
```python
config = ProcessingConfig(
hi_res_pdf=True, # High-resolution PDF processing
ocr_enabled=True, # Enable OCR
ocr_engine="tesseract", # OCR engine selection
chunk_size=1000, # Text chunk size
language="en", # Processing language
max_threads=4 # Parallel processing threads
)
```
## 🔍 Error Handling
IndoxMiner provides detailed error reporting:
```python
results = await extractor.extract(documents)
if not results.is_valid:
for chunk_idx, errors in results.validation_errors.items():
print(f"Errors in chunk {chunk_idx}:")
for error in errors:
print(f"- {error.field}: {error.message}")
# Access valid results
valid_data = results.get_valid_results()
```
## 🤝 Contributing
We welcome contributions! To contribute:
1. Fork the repository
2. Create a feature branch
3. Commit your changes
4. Push to the branch
5. Open a Pull Request
Please read our [Contributing Guidelines](CONTRIBUTING.md) for more details.
## 📄 License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## 🆘 Support
- **Documentation**: [Full documentation](https://indoxminer.readthedocs.io/)
- **Issues**: [GitHub Issues](https://github.com/username/indoxminer/issues)
- **Discussions**: [GitHub Discussions](https://github.com/username/indoxminer/discussions)
## 🌟 Star History
[![Star History Chart](https://api.star-history.com/svg?repos=username/indoxminer&type=Date)](https://star-history.com/#username/indoxminer&Date)
Raw data
{
"_id": null,
"home_page": "https://github.com/osllmai/inDox",
"name": "indoxMiner",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "document-processing, information-extraction, llm, openai, pdf, text-processing, natural-language-processing",
"author": "nerdstudio",
"author_email": "ashkan@nematifamilyfundation.onmicrosoft.com",
"download_url": "https://files.pythonhosted.org/packages/0f/b2/d2939a0e7e9d9b37e514b0da20c70df0b966fc31a3b35fa449ad2a5dc8be/indoxminer-0.0.7.tar.gz",
"platform": null,
"description": "# IndoxMiner\r\n\r\n[![PyPI version](https://badge.fury.io/py/indoxminer.svg)](https://badge.fury.io/py/indoxminer)\r\n[![License: MIT](https://img.shields.io/badge/License-AGPL-yellow.svg)](https://opensource.org/licenses/AGPL)\r\n\r\nIndoxMiner is a powerful Python library that leverages Large Language Models (LLMs) to extract structured information from unstructured data sources including text, PDFs, and images. Using a flexible schema-based approach, it enables precise data extraction, validation, and transformation, making it ideal for automating document processing workflows.\r\n\r\n## \ud83d\ude80 Key Features\r\n\r\n- **Multi-Format Support**: Extract data from text, PDFs, images, and scanned documents\r\n- **Schema-Based Extraction**: Define custom schemas to specify exactly what data to extract\r\n- **LLM Integration**: Seamless integration with OpenAI models for intelligent extraction\r\n- **Validation & Type Safety**: Built-in validation rules and type-safe field definitions\r\n- **Flexible Output**: Export to JSON, pandas DataFrames, or custom formats\r\n- **Async Support**: Built for scalability with asynchronous processing capabilities\r\n- **OCR Integration**: Multiple OCR engine options for image-based text extraction\r\n- **High-Resolution Support**: Enhanced processing for high-quality PDFs\r\n- **Error Handling**: Comprehensive error handling and validation reporting\r\n\r\n## \ud83d\udce6 Installation\r\n\r\n```bash\r\npip install indoxminer\r\n```\r\n\r\n## \ud83c\udfaf Quick Start\r\n\r\n### Basic Text Extraction\r\n\r\n```python\r\nfrom indoxminer import ExtractorSchema, Field, FieldType, ValidationRule, Extractor, OpenAi\r\n\r\n# Initialize OpenAI extractor\r\nllm_extractor = OpenAi(\r\n api_key=\"your-api-key\",\r\n model=\"gpt-4-mini\"\r\n)\r\n\r\n# Define extraction schema\r\nschema = ExtractorSchema(\r\n fields=[\r\n Field(\r\n name=\"product_name\",\r\n description=\"Product name\",\r\n field_type=FieldType.STRING,\r\n rules=ValidationRule(min_length=2)\r\n ),\r\n Field(\r\n name=\"price\",\r\n description=\"Price in USD\",\r\n field_type=FieldType.FLOAT,\r\n rules=ValidationRule(min_value=0)\r\n )\r\n ]\r\n)\r\n\r\n# Create extractor and process text\r\nextractor = Extractor(llm=llm_extractor, schema=schema)\r\ntext = \"\"\"\r\nMacBook Pro 16-inch with M2 chip\r\nPrice: $2,399.99\r\nIn stock: Yes\r\n\"\"\"\r\n\r\n# Extract and convert to DataFrame\r\nresult = await extractor.extract(text)\r\ndf = extractor.to_dataframe(result)\r\n```\r\n\r\n### PDF Processing\r\n\r\n```python\r\nfrom indoxminer import DocumentProcessor, ProcessingConfig\r\n\r\n# Initialize processor with custom config\r\nprocessor = DocumentProcessor(\r\n files=[\"invoice.pdf\"],\r\n config=ProcessingConfig(\r\n hi_res_pdf=True,\r\n chunk_size=1000\r\n )\r\n)\r\n\r\n# Process document\r\ndocuments = processor.process()\r\n\r\n# Extract structured data\r\nschema = ExtractorSchema(\r\n fields=[\r\n Field(\r\n name=\"bill_to\",\r\n description=\"Billing address\",\r\n field_type=FieldType.STRING\r\n ),\r\n Field(\r\n name=\"invoice_date\",\r\n description=\"Invoice date\",\r\n field_type=FieldType.DATE\r\n ),\r\n Field(\r\n name=\"total_amount\",\r\n description=\"Total amount in USD\",\r\n field_type=FieldType.FLOAT\r\n )\r\n ]\r\n)\r\n\r\nresults = await extractor.extract(documents)\r\n```\r\n\r\n### Image Processing with OCR\r\n\r\n```python\r\n# Configure OCR-enabled processor\r\nconfig = ProcessingConfig(\r\n ocr_enabled=True,\r\n ocr_engine=\"easyocr\", # or \"tesseract\", \"paddle\"\r\n language=\"en\"\r\n)\r\n\r\nprocessor = DocumentProcessor(\r\n files=[\"receipt.jpg\"],\r\n config=config\r\n)\r\n\r\n# Process image and extract text\r\ndocuments = processor.process()\r\n```\r\n\r\n## \ud83d\udd27 Core Components\r\n\r\n### ExtractorSchema\r\n\r\nDefines the structure of data to be extracted:\r\n\r\n- Field definitions\r\n- Validation rules\r\n- Output format specifications\r\n\r\n```python\r\nschema = ExtractorSchema(\r\n fields=[...],\r\n output_format=\"json\"\r\n)\r\n```\r\n\r\n### Field Types\r\n\r\nSupported field types:\r\n\r\n- `STRING`: Text data\r\n- `INTEGER`: Whole numbers\r\n- `FLOAT`: Decimal numbers\r\n- `DATE`: Date values\r\n- `BOOLEAN`: True/False values\r\n- `LIST`: Arrays of values\r\n- `DICT`: Nested objects\r\n\r\n### Validation Rules\r\n\r\nAvailable validation options:\r\n\r\n- `min_length`/`max_length`: String length constraints\r\n- `min_value`/`max_value`: Numeric bounds\r\n- `pattern`: Regex patterns\r\n- `required`: Required fields\r\n- `custom`: Custom validation functions\r\n\r\n## \u2699\ufe0f Configuration Options\r\n\r\n### ProcessingConfig\r\n\r\n```python\r\nconfig = ProcessingConfig(\r\n hi_res_pdf=True, # High-resolution PDF processing\r\n ocr_enabled=True, # Enable OCR\r\n ocr_engine=\"tesseract\", # OCR engine selection\r\n chunk_size=1000, # Text chunk size\r\n language=\"en\", # Processing language\r\n max_threads=4 # Parallel processing threads\r\n)\r\n```\r\n\r\n## \ud83d\udd0d Error Handling\r\n\r\nIndoxMiner provides detailed error reporting:\r\n\r\n```python\r\nresults = await extractor.extract(documents)\r\n\r\nif not results.is_valid:\r\n for chunk_idx, errors in results.validation_errors.items():\r\n print(f\"Errors in chunk {chunk_idx}:\")\r\n for error in errors:\r\n print(f\"- {error.field}: {error.message}\")\r\n\r\n# Access valid results\r\nvalid_data = results.get_valid_results()\r\n```\r\n\r\n## \ud83e\udd1d Contributing\r\n\r\nWe welcome contributions! To contribute:\r\n\r\n1. Fork the repository\r\n2. Create a feature branch\r\n3. Commit your changes\r\n4. Push to the branch\r\n5. Open a Pull Request\r\n\r\nPlease read our [Contributing Guidelines](CONTRIBUTING.md) for more details.\r\n\r\n## \ud83d\udcc4 License\r\n\r\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\r\n\r\n## \ud83c\udd98 Support\r\n\r\n- **Documentation**: [Full documentation](https://indoxminer.readthedocs.io/)\r\n- **Issues**: [GitHub Issues](https://github.com/username/indoxminer/issues)\r\n- **Discussions**: [GitHub Discussions](https://github.com/username/indoxminer/discussions)\r\n\r\n## \ud83c\udf1f Star History\r\n\r\n[![Star History Chart](https://api.star-history.com/svg?repos=username/indoxminer&type=Date)](https://star-history.com/#username/indoxminer&Date)\r\n",
"bugtrack_url": null,
"license": "AGPL-3.0",
"summary": "Indox Data Extraction",
"version": "0.0.7",
"project_urls": {
"Homepage": "https://github.com/osllmai/inDox"
},
"split_keywords": [
"document-processing",
" information-extraction",
" llm",
" openai",
" pdf",
" text-processing",
" natural-language-processing"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "fd1fcc828b09b0876dfb60f21c41ce2a2c6a6b21d55ceeb2f3ec860c046dde75",
"md5": "19ce5e9eb0c4f879e0064a8b0c611ae9",
"sha256": "996241f7085c9a40bb18c6373817063440761eef1c71f35c64a80dab2262bab3"
},
"downloads": -1,
"filename": "indoxMiner-0.0.7-py3-none-any.whl",
"has_sig": false,
"md5_digest": "19ce5e9eb0c4f879e0064a8b0c611ae9",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 36959,
"upload_time": "2024-11-05T08:05:50",
"upload_time_iso_8601": "2024-11-05T08:05:50.796207Z",
"url": "https://files.pythonhosted.org/packages/fd/1f/cc828b09b0876dfb60f21c41ce2a2c6a6b21d55ceeb2f3ec860c046dde75/indoxMiner-0.0.7-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "0fb2d2939a0e7e9d9b37e514b0da20c70df0b966fc31a3b35fa449ad2a5dc8be",
"md5": "bfa08e7a911dcb478192d38b5e13339f",
"sha256": "9536fc8e38abc8d0ee611e30f75ac1eaac786efe15ddc728d52e16850398fffe"
},
"downloads": -1,
"filename": "indoxminer-0.0.7.tar.gz",
"has_sig": false,
"md5_digest": "bfa08e7a911dcb478192d38b5e13339f",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 37339,
"upload_time": "2024-11-05T08:05:52",
"upload_time_iso_8601": "2024-11-05T08:05:52.543599Z",
"url": "https://files.pythonhosted.org/packages/0f/b2/d2939a0e7e9d9b37e514b0da20c70df0b966fc31a3b35fa449ad2a5dc8be/indoxminer-0.0.7.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-11-05 08:05:52",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "osllmai",
"github_project": "inDox",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "latex2markdown",
"specs": [
[
"==",
"0.2.1"
]
]
},
{
"name": "loguru",
"specs": [
[
"==",
"0.7.2"
]
]
},
{
"name": "numpy",
"specs": [
[
"==",
"2.0.0"
]
]
},
{
"name": "pandas",
"specs": [
[
"==",
"2.0.3"
]
]
},
{
"name": "protobuf",
"specs": [
[
"==",
"5.27.2"
]
]
},
{
"name": "pydantic",
"specs": [
[
"==",
"2.8.2"
]
]
},
{
"name": "PyPDF2",
"specs": [
[
"==",
"3.0.1"
]
]
},
{
"name": "python-dotenv",
"specs": [
[
"==",
"1.0.1"
]
]
},
{
"name": "Requests",
"specs": [
[
"==",
"2.32.3"
]
]
},
{
"name": "setuptools",
"specs": [
[
"==",
"69.1.1"
]
]
},
{
"name": "tenacity",
"specs": [
[
"==",
"8.2.3"
]
]
},
{
"name": "tiktoken",
"specs": [
[
"==",
"0.6.0"
]
]
},
{
"name": "tokenizers",
"specs": [
[
"==",
"0.15.2"
]
]
},
{
"name": "umap_learn",
"specs": [
[
"==",
"0.5.6"
]
]
},
{
"name": "unstructured",
"specs": [
[
"==",
"0.15.8"
]
]
},
{
"name": "nltk",
"specs": [
[
"==",
"3.9.1"
]
]
},
{
"name": "pillow_heif",
"specs": [
[
"==",
"0.18.0"
]
]
}
],
"lcname": "indoxminer"
}