# IndoxMiner
[data:image/s3,"s3://crabby-images/9d9d7/9d9d7cdf15cff6bfef2cbd6eb47afb5ad250f386" alt="PyPI version"](https://badge.fury.io/py/indoxminer)
[data:image/s3,"s3://crabby-images/4388c/4388c170acb443b721fb5aa9ae9d9713cd4a5a05" alt="License: MIT"](https://opensource.org/licenses/AGPL)
IndoxMiner is a powerful Python library that integrates **Large Language Models (LLMs)** for **data extraction** and cutting-edge **object detection**. Whether you're working with unstructured data such as text, PDFs, or images, or performing object detection with pretrained models, IndoxMiner streamlines your workflows with flexibility and high accuracy.
## 🚀 Key Features
- **Data Extraction**: Extract structured data from text, PDFs, and images using schema-based extraction and LLMs.
- **Object Detection**: Leverage pre-trained object detection models for high-accuracy real-time image recognition.
- **OCR Integration**: Extract text from scanned documents or images with integrated OCR engines.
- **Schema-Based Extraction**: Define custom schemas for data extraction with validation and type-safety.
- **Multi-Model Support**: Supports a wide range of object detection models including YOLO, DETR, and more.
- **Async Support**: Built for scalability with asynchronous processing capabilities.
- **Flexible Outputs**: Export results to JSON, pandas DataFrames, or custom formats.
---
## 📦 Installation
Install IndoxMiner with:
```bash
pip install indoxminer
```
---
## 📝 Quick Start
### 1. Data Extraction
IndoxMiner integrates seamlessly with OpenAI models for **schema-based extraction** from text, PDFs, and images. Here's how you can extract structured data from a document:
#### Basic Text Extraction
```python
from indoxminer import ExtractorSchema, Field, FieldType, ValidationRule, Extractor, OpenAi
# Initialize OpenAI extractor
llm_extractor = OpenAi(api_key="your-api-key", model="gpt-4-mini")
# Define extraction schema
schema = ExtractorSchema(
fields=[
Field(name="product_name", field_type=FieldType.STRING, rules=ValidationRule(min_length=2)),
Field(name="price", field_type=FieldType.FLOAT, rules=ValidationRule(min_value=0))
]
)
# Create extractor and process text
extractor = Extractor(llm=llm_extractor, schema=schema)
text = """
MacBook Pro 16-inch with M2 chip
Price: $2,399.99
In stock: Yes
"""
result = await extractor.extract(text)
df = extractor.to_dataframe(result)
```
#### PDF Processing
```python
from indoxminer import DocumentProcessor, ProcessingConfig
processor = DocumentProcessor(
files=["invoice.pdf"],
config=ProcessingConfig(hi_res_pdf=True, chunk_size=1000)
)
documents = processor.process()
# Define schema and extract structured data
schema = ExtractorSchema(
fields=[
Field(name="bill_to", field_type=FieldType.STRING),
Field(name="invoice_date", field_type=FieldType.DATE),
Field(name="total_amount", field_type=FieldType.FLOAT)
]
)
results = await extractor.extract(documents)
```
#### Image Processing with OCR
```python
config = ProcessingConfig(ocr_enabled=True, ocr_engine="easyocr", language="en")
processor = DocumentProcessor(files=["receipt.jpg"], config=config)
documents = processor.process()
```
---
### 2. Object Detection
IndoxMiner provides powerful **object detection** capabilities with support for a variety of models, such as YOLO, Detectron2, and DETR. Here's how to use these models for real-time image recognition.
#### Supported Models for Object Detection
| Model | Supported ✅ |
|---------------|:------------:|
| **Detectron2** | ✅ |
| **DETR** | ✅ |
| **DETR-CLIP** | ✅ |
| **GroundingDINO** | ✅ |
| **Kosmos2** | ✅ |
| **OWL-ViT** | ✅ |
| **RT-DETR** | ✅ |
| **SAM2** | ✅ |
| **YOLOv5** | ✅ |
| **YOLOv6** | ✅ |
| **YOLOv7** | ✅ |
| **YOLOv8** | ✅ |
| **YOLOv10** | ✅ |
| **YOLOv11** | ✅ |
| **YOLOX** | ✅ |
| **YOLO-World** | ❌ |
---
#### Object Detection with YOLOv5
```python
from indoxminer.detection import YOLOv5
# Initialize YOLOv5 model
detector = YOLOv5()
# Detect objects in an image
image_path = "dog-cat-under-sheet.jpg"
outputs = await detector.detect_objects(image_path)
# Visualize results
detector.visualize_results(outputs)
```
You can also switch to other models by specifying the model name, e.g., `detectron2`, `detr`, `yolov8`, etc.
```python
detector = YOLOv8() # For YOLOv8
```
---
## 🔧 Core Components
### ExtractorSchema
Defines the structure of data to be extracted:
- Field definitions
- Validation rules
- Output format specifications
```python
schema = ExtractorSchema(
fields=[...],
output_format="json"
)
```
### Field Types
Supported field types:
- `STRING`: Text data
- `INTEGER`: Whole numbers
- `FLOAT`: Decimal numbers
- `DATE`: Date values
- `BOOLEAN`: True/False values
- `LIST`: Arrays of values
- `DICT`: Nested objects
### Validation Rules
Available validation options:
- `min_length`/`max_length`: String length constraints
- `min_value`/`max_value`: Numeric bounds
- `pattern`: Regex patterns
- `required`: Required fields
- `custom`: Custom validation functions
---
### Configuration Options
#### ProcessingConfig
Customize document processing behavior:
```python
config = ProcessingConfig(
hi_res_pdf=True, # High-resolution PDF processing
ocr_enabled=True, # Enable OCR
ocr_engine="tesseract", # OCR engine selection
chunk_size=1000, # Text chunk size
language="en", # Processing language
max_threads=4 # Parallel processing threads
)
```
---
## 🔍 Error Handling
IndoxMiner provides detailed error reporting for both data extraction and object detection.
```python
results = await extractor.extract(documents)
if not results.is_valid:
for chunk_idx, errors in results.validation_errors.items():
print(f"Errors in chunk {chunk_idx}:")
for error in errors:
print(f"- {error.field}: {error.message}")
# Access valid results
valid_data = results.get_valid_results()
```
---
## 🤝 Contributing
We welcome contributions! To contribute:
1. Fork the repository
2. Create a feature branch
3. Commit your changes
4. Push to the branch
5. Open a Pull Request
Please read our [Contributing Guidelines](CONTRIBUTING.md) for more details.
---
## 📄 License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
---
## 🆘 Support
- **Documentation**: [Full documentation](https://indoxminer.readthedocs.io/)
- **Issues**: [GitHub Issues](https://github.com/username/indoxminer/issues)
- **Discussions**: [GitHub Discussions](https://github.com/username/indoxminer/discussions)
---
## 🌟 Star History
[data:image/s3,"s3://crabby-images/4dd2e/4dd2e2390072767d3408291e45113f17a31d3ccc" alt="Star History Chart"](https://star-history.com/#username/indoxminer&Date)
Raw data
{
"_id": null,
"home_page": "https://github.com/osllmai/inDox/libs/indoxMiner",
"name": "indoxMiner",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "document-processing, information-extraction, llm, openai, pdf, text-processing, natural-language-processing",
"author": "nerdstudio",
"author_email": "ashkan@nematifamilyfundation.onmicrosoft.com",
"download_url": "https://files.pythonhosted.org/packages/12/bf/102af2914aa3cfeba8b8406192aead57f377580f3ccc627087c77eb0bc84/indoxminer-0.1.4.tar.gz",
"platform": null,
"description": "# IndoxMiner\r\n\r\n[data:image/s3,"s3://crabby-images/9d9d7/9d9d7cdf15cff6bfef2cbd6eb47afb5ad250f386" alt="PyPI version"](https://badge.fury.io/py/indoxminer) \r\n[data:image/s3,"s3://crabby-images/4388c/4388c170acb443b721fb5aa9ae9d9713cd4a5a05" alt="License: MIT"](https://opensource.org/licenses/AGPL)\r\n\r\nIndoxMiner is a powerful Python library that integrates **Large Language Models (LLMs)** for **data extraction** and cutting-edge **object detection**. Whether you're working with unstructured data such as text, PDFs, or images, or performing object detection with pretrained models, IndoxMiner streamlines your workflows with flexibility and high accuracy.\r\n\r\n## \ud83d\ude80 Key Features\r\n\r\n- **Data Extraction**: Extract structured data from text, PDFs, and images using schema-based extraction and LLMs.\r\n- **Object Detection**: Leverage pre-trained object detection models for high-accuracy real-time image recognition.\r\n- **OCR Integration**: Extract text from scanned documents or images with integrated OCR engines.\r\n- **Schema-Based Extraction**: Define custom schemas for data extraction with validation and type-safety.\r\n- **Multi-Model Support**: Supports a wide range of object detection models including YOLO, DETR, and more.\r\n- **Async Support**: Built for scalability with asynchronous processing capabilities.\r\n- **Flexible Outputs**: Export results to JSON, pandas DataFrames, or custom formats.\r\n\r\n---\r\n\r\n## \ud83d\udce6 Installation\r\n\r\nInstall IndoxMiner with:\r\n\r\n```bash\r\npip install indoxminer\r\n```\r\n\r\n---\r\n\r\n## \ud83d\udcdd Quick Start\r\n\r\n### 1. Data Extraction\r\n\r\nIndoxMiner integrates seamlessly with OpenAI models for **schema-based extraction** from text, PDFs, and images. Here's how you can extract structured data from a document:\r\n\r\n#### Basic Text Extraction\r\n\r\n```python\r\nfrom indoxminer import ExtractorSchema, Field, FieldType, ValidationRule, Extractor, OpenAi\r\n\r\n# Initialize OpenAI extractor\r\nllm_extractor = OpenAi(api_key=\"your-api-key\", model=\"gpt-4-mini\")\r\n\r\n# Define extraction schema\r\nschema = ExtractorSchema(\r\n fields=[\r\n Field(name=\"product_name\", field_type=FieldType.STRING, rules=ValidationRule(min_length=2)),\r\n Field(name=\"price\", field_type=FieldType.FLOAT, rules=ValidationRule(min_value=0))\r\n ]\r\n)\r\n\r\n# Create extractor and process text\r\nextractor = Extractor(llm=llm_extractor, schema=schema)\r\ntext = \"\"\"\r\nMacBook Pro 16-inch with M2 chip\r\nPrice: $2,399.99\r\nIn stock: Yes\r\n\"\"\"\r\n\r\nresult = await extractor.extract(text)\r\ndf = extractor.to_dataframe(result)\r\n```\r\n\r\n#### PDF Processing\r\n\r\n```python\r\nfrom indoxminer import DocumentProcessor, ProcessingConfig\r\n\r\nprocessor = DocumentProcessor(\r\n files=[\"invoice.pdf\"],\r\n config=ProcessingConfig(hi_res_pdf=True, chunk_size=1000)\r\n)\r\n\r\ndocuments = processor.process()\r\n\r\n# Define schema and extract structured data\r\nschema = ExtractorSchema(\r\n fields=[\r\n Field(name=\"bill_to\", field_type=FieldType.STRING),\r\n Field(name=\"invoice_date\", field_type=FieldType.DATE),\r\n Field(name=\"total_amount\", field_type=FieldType.FLOAT)\r\n ]\r\n)\r\n\r\nresults = await extractor.extract(documents)\r\n```\r\n\r\n#### Image Processing with OCR\r\n\r\n```python\r\nconfig = ProcessingConfig(ocr_enabled=True, ocr_engine=\"easyocr\", language=\"en\")\r\nprocessor = DocumentProcessor(files=[\"receipt.jpg\"], config=config)\r\n\r\ndocuments = processor.process()\r\n```\r\n\r\n---\r\n\r\n### 2. Object Detection\r\n\r\nIndoxMiner provides powerful **object detection** capabilities with support for a variety of models, such as YOLO, Detectron2, and DETR. Here's how to use these models for real-time image recognition.\r\n\r\n#### Supported Models for Object Detection\r\n\r\n| Model | Supported \u2705 |\r\n|---------------|:------------:|\r\n| **Detectron2** | \u2705 |\r\n| **DETR** | \u2705 |\r\n| **DETR-CLIP** | \u2705 |\r\n| **GroundingDINO** | \u2705 |\r\n| **Kosmos2** | \u2705 |\r\n| **OWL-ViT** | \u2705 |\r\n| **RT-DETR** | \u2705 |\r\n| **SAM2** | \u2705 |\r\n| **YOLOv5** | \u2705 |\r\n| **YOLOv6** | \u2705 |\r\n| **YOLOv7** | \u2705 |\r\n| **YOLOv8** | \u2705 |\r\n| **YOLOv10** | \u2705 |\r\n| **YOLOv11** | \u2705 |\r\n| **YOLOX** | \u2705 |\r\n| **YOLO-World** | \u274c |\r\n\r\n\r\n---\r\n\r\n#### Object Detection with YOLOv5\r\n\r\n```python\r\nfrom indoxminer.detection import YOLOv5\r\n\r\n# Initialize YOLOv5 model\r\ndetector = YOLOv5()\r\n\r\n# Detect objects in an image\r\nimage_path = \"dog-cat-under-sheet.jpg\"\r\noutputs = await detector.detect_objects(image_path)\r\n\r\n# Visualize results\r\ndetector.visualize_results(outputs)\r\n```\r\n\r\nYou can also switch to other models by specifying the model name, e.g., `detectron2`, `detr`, `yolov8`, etc.\r\n\r\n```python\r\ndetector = YOLOv8() # For YOLOv8\r\n```\r\n\r\n---\r\n\r\n## \ud83d\udd27 Core Components\r\n\r\n### ExtractorSchema\r\n\r\nDefines the structure of data to be extracted:\r\n\r\n- Field definitions\r\n- Validation rules\r\n- Output format specifications\r\n\r\n```python\r\nschema = ExtractorSchema(\r\n fields=[...],\r\n output_format=\"json\"\r\n)\r\n```\r\n\r\n### Field Types\r\n\r\nSupported field types:\r\n\r\n- `STRING`: Text data\r\n- `INTEGER`: Whole numbers\r\n- `FLOAT`: Decimal numbers\r\n- `DATE`: Date values\r\n- `BOOLEAN`: True/False values\r\n- `LIST`: Arrays of values\r\n- `DICT`: Nested objects\r\n\r\n### Validation Rules\r\n\r\nAvailable validation options:\r\n\r\n- `min_length`/`max_length`: String length constraints\r\n- `min_value`/`max_value`: Numeric bounds\r\n- `pattern`: Regex patterns\r\n- `required`: Required fields\r\n- `custom`: Custom validation functions\r\n\r\n---\r\n\r\n### Configuration Options\r\n\r\n#### ProcessingConfig\r\n\r\nCustomize document processing behavior:\r\n\r\n```python\r\nconfig = ProcessingConfig(\r\n hi_res_pdf=True, # High-resolution PDF processing\r\n ocr_enabled=True, # Enable OCR\r\n ocr_engine=\"tesseract\", # OCR engine selection\r\n chunk_size=1000, # Text chunk size\r\n language=\"en\", # Processing language\r\n max_threads=4 # Parallel processing threads\r\n)\r\n```\r\n\r\n---\r\n\r\n## \ud83d\udd0d Error Handling\r\n\r\nIndoxMiner provides detailed error reporting for both data extraction and object detection.\r\n\r\n```python\r\nresults = await extractor.extract(documents)\r\n\r\nif not results.is_valid:\r\n for chunk_idx, errors in results.validation_errors.items():\r\n print(f\"Errors in chunk {chunk_idx}:\")\r\n for error in errors:\r\n print(f\"- {error.field}: {error.message}\")\r\n\r\n# Access valid results\r\nvalid_data = results.get_valid_results()\r\n```\r\n\r\n---\r\n\r\n## \ud83e\udd1d Contributing\r\n\r\nWe welcome contributions! To contribute:\r\n\r\n1. Fork the repository\r\n2. Create a feature branch\r\n3. Commit your changes\r\n4. Push to the branch\r\n5. Open a Pull Request\r\n\r\nPlease read our [Contributing Guidelines](CONTRIBUTING.md) for more details.\r\n\r\n---\r\n\r\n## \ud83d\udcc4 License\r\n\r\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\r\n\r\n---\r\n\r\n## \ud83c\udd98 Support\r\n\r\n- **Documentation**: [Full documentation](https://indoxminer.readthedocs.io/)\r\n- **Issues**: [GitHub Issues](https://github.com/username/indoxminer/issues)\r\n- **Discussions**: [GitHub Discussions](https://github.com/username/indoxminer/discussions)\r\n\r\n--- \r\n\r\n## \ud83c\udf1f Star History\r\n\r\n[data:image/s3,"s3://crabby-images/4dd2e/4dd2e2390072767d3408291e45113f17a31d3ccc" alt="Star History Chart"](https://star-history.com/#username/indoxminer&Date)\r\n",
"bugtrack_url": null,
"license": "AGPL-3.0",
"summary": "Indox Data Extraction",
"version": "0.1.4",
"project_urls": {
"Homepage": "https://github.com/osllmai/inDox/libs/indoxMiner"
},
"split_keywords": [
"document-processing",
" information-extraction",
" llm",
" openai",
" pdf",
" text-processing",
" natural-language-processing"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "8f22131f868d45e781190788910c88b546d3e056d9465e0d1cfefd1efff4b1f5",
"md5": "60422b3b81ac77a0a7ed8733c5a19f4a",
"sha256": "614f35816242ae3881de8085c7511114f7c8ea91fa6cc7fa02b52c99d1c4fb87"
},
"downloads": -1,
"filename": "indoxMiner-0.1.4-py3-none-any.whl",
"has_sig": false,
"md5_digest": "60422b3b81ac77a0a7ed8733c5a19f4a",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 80424,
"upload_time": "2024-12-29T09:52:39",
"upload_time_iso_8601": "2024-12-29T09:52:39.103672Z",
"url": "https://files.pythonhosted.org/packages/8f/22/131f868d45e781190788910c88b546d3e056d9465e0d1cfefd1efff4b1f5/indoxMiner-0.1.4-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "12bf102af2914aa3cfeba8b8406192aead57f377580f3ccc627087c77eb0bc84",
"md5": "4ca885fc4565de273696d791132c639d",
"sha256": "8ade4f4513f9da1728b6c1a40dcb54846170bd46dfa8d2f27a3dc0884f9d856b"
},
"downloads": -1,
"filename": "indoxminer-0.1.4.tar.gz",
"has_sig": false,
"md5_digest": "4ca885fc4565de273696d791132c639d",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 65640,
"upload_time": "2024-12-29T09:52:42",
"upload_time_iso_8601": "2024-12-29T09:52:42.021622Z",
"url": "https://files.pythonhosted.org/packages/12/bf/102af2914aa3cfeba8b8406192aead57f377580f3ccc627087c77eb0bc84/indoxminer-0.1.4.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-12-29 09:52:42",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "osllmai",
"github_project": "inDox",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "indoxminer"
}