indoxMiner


NameindoxMiner JSON
Version 0.1.4 PyPI version JSON
download
home_pagehttps://github.com/osllmai/inDox/libs/indoxMiner
SummaryIndox Data Extraction
upload_time2024-12-29 09:52:42
maintainerNone
docs_urlNone
authornerdstudio
requires_python>=3.9
licenseAGPL-3.0
keywords document-processing information-extraction llm openai pdf text-processing natural-language-processing
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # IndoxMiner

[![PyPI version](https://badge.fury.io/py/indoxminer.svg)](https://badge.fury.io/py/indoxminer)  
[![License: MIT](https://img.shields.io/badge/License-AGPL-yellow.svg)](https://opensource.org/licenses/AGPL)

IndoxMiner is a powerful Python library that integrates **Large Language Models (LLMs)** for **data extraction** and cutting-edge **object detection**. Whether you're working with unstructured data such as text, PDFs, or images, or performing object detection with pretrained models, IndoxMiner streamlines your workflows with flexibility and high accuracy.

## 🚀 Key Features

- **Data Extraction**: Extract structured data from text, PDFs, and images using schema-based extraction and LLMs.
- **Object Detection**: Leverage pre-trained object detection models for high-accuracy real-time image recognition.
- **OCR Integration**: Extract text from scanned documents or images with integrated OCR engines.
- **Schema-Based Extraction**: Define custom schemas for data extraction with validation and type-safety.
- **Multi-Model Support**: Supports a wide range of object detection models including YOLO, DETR, and more.
- **Async Support**: Built for scalability with asynchronous processing capabilities.
- **Flexible Outputs**: Export results to JSON, pandas DataFrames, or custom formats.

---

## 📦 Installation

Install IndoxMiner with:

```bash
pip install indoxminer
```

---

## 📝 Quick Start

### 1. Data Extraction

IndoxMiner integrates seamlessly with OpenAI models for **schema-based extraction** from text, PDFs, and images. Here's how you can extract structured data from a document:

#### Basic Text Extraction

```python
from indoxminer import ExtractorSchema, Field, FieldType, ValidationRule, Extractor, OpenAi

# Initialize OpenAI extractor
llm_extractor = OpenAi(api_key="your-api-key", model="gpt-4-mini")

# Define extraction schema
schema = ExtractorSchema(
    fields=[
        Field(name="product_name", field_type=FieldType.STRING, rules=ValidationRule(min_length=2)),
        Field(name="price", field_type=FieldType.FLOAT, rules=ValidationRule(min_value=0))
    ]
)

# Create extractor and process text
extractor = Extractor(llm=llm_extractor, schema=schema)
text = """
MacBook Pro 16-inch with M2 chip
Price: $2,399.99
In stock: Yes
"""

result = await extractor.extract(text)
df = extractor.to_dataframe(result)
```

#### PDF Processing

```python
from indoxminer import DocumentProcessor, ProcessingConfig

processor = DocumentProcessor(
    files=["invoice.pdf"],
    config=ProcessingConfig(hi_res_pdf=True, chunk_size=1000)
)

documents = processor.process()

# Define schema and extract structured data
schema = ExtractorSchema(
    fields=[
        Field(name="bill_to", field_type=FieldType.STRING),
        Field(name="invoice_date", field_type=FieldType.DATE),
        Field(name="total_amount", field_type=FieldType.FLOAT)
    ]
)

results = await extractor.extract(documents)
```

#### Image Processing with OCR

```python
config = ProcessingConfig(ocr_enabled=True, ocr_engine="easyocr", language="en")
processor = DocumentProcessor(files=["receipt.jpg"], config=config)

documents = processor.process()
```

---

### 2. Object Detection

IndoxMiner provides powerful **object detection** capabilities with support for a variety of models, such as YOLO, Detectron2, and DETR. Here's how to use these models for real-time image recognition.

#### Supported Models for Object Detection

| Model         | Supported ✅ |
|---------------|:------------:|
| **Detectron2** | ✅          |
| **DETR**       | ✅          |
| **DETR-CLIP**  | ✅          |
| **GroundingDINO** | ✅       |
| **Kosmos2**    | ✅          |
| **OWL-ViT**    | ✅          |
| **RT-DETR**    | ✅          |
| **SAM2**       | ✅          |
| **YOLOv5**     | ✅          |
| **YOLOv6**     | ✅          |
| **YOLOv7**     | ✅          |
| **YOLOv8**     | ✅          |
| **YOLOv10**    | ✅          |
| **YOLOv11**    | ✅          |
| **YOLOX**      | ✅          |
| **YOLO-World**      | ❌          |


---

#### Object Detection with YOLOv5

```python
from indoxminer.detection import YOLOv5

# Initialize YOLOv5 model
detector = YOLOv5()

# Detect objects in an image
image_path = "dog-cat-under-sheet.jpg"
outputs = await detector.detect_objects(image_path)

# Visualize results
detector.visualize_results(outputs)
```

You can also switch to other models by specifying the model name, e.g., `detectron2`, `detr`, `yolov8`, etc.

```python
detector = YOLOv8()  # For YOLOv8
```

---

## 🔧 Core Components

### ExtractorSchema

Defines the structure of data to be extracted:

- Field definitions
- Validation rules
- Output format specifications

```python
schema = ExtractorSchema(
    fields=[...],
    output_format="json"
)
```

### Field Types

Supported field types:

- `STRING`: Text data
- `INTEGER`: Whole numbers
- `FLOAT`: Decimal numbers
- `DATE`: Date values
- `BOOLEAN`: True/False values
- `LIST`: Arrays of values
- `DICT`: Nested objects

### Validation Rules

Available validation options:

- `min_length`/`max_length`: String length constraints
- `min_value`/`max_value`: Numeric bounds
- `pattern`: Regex patterns
- `required`: Required fields
- `custom`: Custom validation functions

---

### Configuration Options

#### ProcessingConfig

Customize document processing behavior:

```python
config = ProcessingConfig(
    hi_res_pdf=True,          # High-resolution PDF processing
    ocr_enabled=True,         # Enable OCR
    ocr_engine="tesseract",   # OCR engine selection
    chunk_size=1000,          # Text chunk size
    language="en",            # Processing language
    max_threads=4             # Parallel processing threads
)
```

---

## 🔍 Error Handling

IndoxMiner provides detailed error reporting for both data extraction and object detection.

```python
results = await extractor.extract(documents)

if not results.is_valid:
    for chunk_idx, errors in results.validation_errors.items():
        print(f"Errors in chunk {chunk_idx}:")
        for error in errors:
            print(f"- {error.field}: {error.message}")

# Access valid results
valid_data = results.get_valid_results()
```

---

## 🤝 Contributing

We welcome contributions! To contribute:

1. Fork the repository
2. Create a feature branch
3. Commit your changes
4. Push to the branch
5. Open a Pull Request

Please read our [Contributing Guidelines](CONTRIBUTING.md) for more details.

---

## 📄 License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

---

## 🆘 Support

- **Documentation**: [Full documentation](https://indoxminer.readthedocs.io/)
- **Issues**: [GitHub Issues](https://github.com/username/indoxminer/issues)
- **Discussions**: [GitHub Discussions](https://github.com/username/indoxminer/discussions)

--- 

## 🌟 Star History

[![Star History Chart](https://api.star-history.com/svg?repos=username/indoxminer&type=Date)](https://star-history.com/#username/indoxminer&Date)

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/osllmai/inDox/libs/indoxMiner",
    "name": "indoxMiner",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "document-processing, information-extraction, llm, openai, pdf, text-processing, natural-language-processing",
    "author": "nerdstudio",
    "author_email": "ashkan@nematifamilyfundation.onmicrosoft.com",
    "download_url": "https://files.pythonhosted.org/packages/12/bf/102af2914aa3cfeba8b8406192aead57f377580f3ccc627087c77eb0bc84/indoxminer-0.1.4.tar.gz",
    "platform": null,
    "description": "# IndoxMiner\r\n\r\n[![PyPI version](https://badge.fury.io/py/indoxminer.svg)](https://badge.fury.io/py/indoxminer)  \r\n[![License: MIT](https://img.shields.io/badge/License-AGPL-yellow.svg)](https://opensource.org/licenses/AGPL)\r\n\r\nIndoxMiner is a powerful Python library that integrates **Large Language Models (LLMs)** for **data extraction** and cutting-edge **object detection**. Whether you're working with unstructured data such as text, PDFs, or images, or performing object detection with pretrained models, IndoxMiner streamlines your workflows with flexibility and high accuracy.\r\n\r\n## \ud83d\ude80 Key Features\r\n\r\n- **Data Extraction**: Extract structured data from text, PDFs, and images using schema-based extraction and LLMs.\r\n- **Object Detection**: Leverage pre-trained object detection models for high-accuracy real-time image recognition.\r\n- **OCR Integration**: Extract text from scanned documents or images with integrated OCR engines.\r\n- **Schema-Based Extraction**: Define custom schemas for data extraction with validation and type-safety.\r\n- **Multi-Model Support**: Supports a wide range of object detection models including YOLO, DETR, and more.\r\n- **Async Support**: Built for scalability with asynchronous processing capabilities.\r\n- **Flexible Outputs**: Export results to JSON, pandas DataFrames, or custom formats.\r\n\r\n---\r\n\r\n## \ud83d\udce6 Installation\r\n\r\nInstall IndoxMiner with:\r\n\r\n```bash\r\npip install indoxminer\r\n```\r\n\r\n---\r\n\r\n## \ud83d\udcdd Quick Start\r\n\r\n### 1. Data Extraction\r\n\r\nIndoxMiner integrates seamlessly with OpenAI models for **schema-based extraction** from text, PDFs, and images. Here's how you can extract structured data from a document:\r\n\r\n#### Basic Text Extraction\r\n\r\n```python\r\nfrom indoxminer import ExtractorSchema, Field, FieldType, ValidationRule, Extractor, OpenAi\r\n\r\n# Initialize OpenAI extractor\r\nllm_extractor = OpenAi(api_key=\"your-api-key\", model=\"gpt-4-mini\")\r\n\r\n# Define extraction schema\r\nschema = ExtractorSchema(\r\n    fields=[\r\n        Field(name=\"product_name\", field_type=FieldType.STRING, rules=ValidationRule(min_length=2)),\r\n        Field(name=\"price\", field_type=FieldType.FLOAT, rules=ValidationRule(min_value=0))\r\n    ]\r\n)\r\n\r\n# Create extractor and process text\r\nextractor = Extractor(llm=llm_extractor, schema=schema)\r\ntext = \"\"\"\r\nMacBook Pro 16-inch with M2 chip\r\nPrice: $2,399.99\r\nIn stock: Yes\r\n\"\"\"\r\n\r\nresult = await extractor.extract(text)\r\ndf = extractor.to_dataframe(result)\r\n```\r\n\r\n#### PDF Processing\r\n\r\n```python\r\nfrom indoxminer import DocumentProcessor, ProcessingConfig\r\n\r\nprocessor = DocumentProcessor(\r\n    files=[\"invoice.pdf\"],\r\n    config=ProcessingConfig(hi_res_pdf=True, chunk_size=1000)\r\n)\r\n\r\ndocuments = processor.process()\r\n\r\n# Define schema and extract structured data\r\nschema = ExtractorSchema(\r\n    fields=[\r\n        Field(name=\"bill_to\", field_type=FieldType.STRING),\r\n        Field(name=\"invoice_date\", field_type=FieldType.DATE),\r\n        Field(name=\"total_amount\", field_type=FieldType.FLOAT)\r\n    ]\r\n)\r\n\r\nresults = await extractor.extract(documents)\r\n```\r\n\r\n#### Image Processing with OCR\r\n\r\n```python\r\nconfig = ProcessingConfig(ocr_enabled=True, ocr_engine=\"easyocr\", language=\"en\")\r\nprocessor = DocumentProcessor(files=[\"receipt.jpg\"], config=config)\r\n\r\ndocuments = processor.process()\r\n```\r\n\r\n---\r\n\r\n### 2. Object Detection\r\n\r\nIndoxMiner provides powerful **object detection** capabilities with support for a variety of models, such as YOLO, Detectron2, and DETR. Here's how to use these models for real-time image recognition.\r\n\r\n#### Supported Models for Object Detection\r\n\r\n| Model         | Supported \u2705 |\r\n|---------------|:------------:|\r\n| **Detectron2** | \u2705          |\r\n| **DETR**       | \u2705          |\r\n| **DETR-CLIP**  | \u2705          |\r\n| **GroundingDINO** | \u2705       |\r\n| **Kosmos2**    | \u2705          |\r\n| **OWL-ViT**    | \u2705          |\r\n| **RT-DETR**    | \u2705          |\r\n| **SAM2**       | \u2705          |\r\n| **YOLOv5**     | \u2705          |\r\n| **YOLOv6**     | \u2705          |\r\n| **YOLOv7**     | \u2705          |\r\n| **YOLOv8**     | \u2705          |\r\n| **YOLOv10**    | \u2705          |\r\n| **YOLOv11**    | \u2705          |\r\n| **YOLOX**      | \u2705          |\r\n| **YOLO-World**      | \u274c          |\r\n\r\n\r\n---\r\n\r\n#### Object Detection with YOLOv5\r\n\r\n```python\r\nfrom indoxminer.detection import YOLOv5\r\n\r\n# Initialize YOLOv5 model\r\ndetector = YOLOv5()\r\n\r\n# Detect objects in an image\r\nimage_path = \"dog-cat-under-sheet.jpg\"\r\noutputs = await detector.detect_objects(image_path)\r\n\r\n# Visualize results\r\ndetector.visualize_results(outputs)\r\n```\r\n\r\nYou can also switch to other models by specifying the model name, e.g., `detectron2`, `detr`, `yolov8`, etc.\r\n\r\n```python\r\ndetector = YOLOv8()  # For YOLOv8\r\n```\r\n\r\n---\r\n\r\n## \ud83d\udd27 Core Components\r\n\r\n### ExtractorSchema\r\n\r\nDefines the structure of data to be extracted:\r\n\r\n- Field definitions\r\n- Validation rules\r\n- Output format specifications\r\n\r\n```python\r\nschema = ExtractorSchema(\r\n    fields=[...],\r\n    output_format=\"json\"\r\n)\r\n```\r\n\r\n### Field Types\r\n\r\nSupported field types:\r\n\r\n- `STRING`: Text data\r\n- `INTEGER`: Whole numbers\r\n- `FLOAT`: Decimal numbers\r\n- `DATE`: Date values\r\n- `BOOLEAN`: True/False values\r\n- `LIST`: Arrays of values\r\n- `DICT`: Nested objects\r\n\r\n### Validation Rules\r\n\r\nAvailable validation options:\r\n\r\n- `min_length`/`max_length`: String length constraints\r\n- `min_value`/`max_value`: Numeric bounds\r\n- `pattern`: Regex patterns\r\n- `required`: Required fields\r\n- `custom`: Custom validation functions\r\n\r\n---\r\n\r\n### Configuration Options\r\n\r\n#### ProcessingConfig\r\n\r\nCustomize document processing behavior:\r\n\r\n```python\r\nconfig = ProcessingConfig(\r\n    hi_res_pdf=True,          # High-resolution PDF processing\r\n    ocr_enabled=True,         # Enable OCR\r\n    ocr_engine=\"tesseract\",   # OCR engine selection\r\n    chunk_size=1000,          # Text chunk size\r\n    language=\"en\",            # Processing language\r\n    max_threads=4             # Parallel processing threads\r\n)\r\n```\r\n\r\n---\r\n\r\n## \ud83d\udd0d Error Handling\r\n\r\nIndoxMiner provides detailed error reporting for both data extraction and object detection.\r\n\r\n```python\r\nresults = await extractor.extract(documents)\r\n\r\nif not results.is_valid:\r\n    for chunk_idx, errors in results.validation_errors.items():\r\n        print(f\"Errors in chunk {chunk_idx}:\")\r\n        for error in errors:\r\n            print(f\"- {error.field}: {error.message}\")\r\n\r\n# Access valid results\r\nvalid_data = results.get_valid_results()\r\n```\r\n\r\n---\r\n\r\n## \ud83e\udd1d Contributing\r\n\r\nWe welcome contributions! To contribute:\r\n\r\n1. Fork the repository\r\n2. Create a feature branch\r\n3. Commit your changes\r\n4. Push to the branch\r\n5. Open a Pull Request\r\n\r\nPlease read our [Contributing Guidelines](CONTRIBUTING.md) for more details.\r\n\r\n---\r\n\r\n## \ud83d\udcc4 License\r\n\r\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\r\n\r\n---\r\n\r\n## \ud83c\udd98 Support\r\n\r\n- **Documentation**: [Full documentation](https://indoxminer.readthedocs.io/)\r\n- **Issues**: [GitHub Issues](https://github.com/username/indoxminer/issues)\r\n- **Discussions**: [GitHub Discussions](https://github.com/username/indoxminer/discussions)\r\n\r\n--- \r\n\r\n## \ud83c\udf1f Star History\r\n\r\n[![Star History Chart](https://api.star-history.com/svg?repos=username/indoxminer&type=Date)](https://star-history.com/#username/indoxminer&Date)\r\n",
    "bugtrack_url": null,
    "license": "AGPL-3.0",
    "summary": "Indox Data Extraction",
    "version": "0.1.4",
    "project_urls": {
        "Homepage": "https://github.com/osllmai/inDox/libs/indoxMiner"
    },
    "split_keywords": [
        "document-processing",
        " information-extraction",
        " llm",
        " openai",
        " pdf",
        " text-processing",
        " natural-language-processing"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "8f22131f868d45e781190788910c88b546d3e056d9465e0d1cfefd1efff4b1f5",
                "md5": "60422b3b81ac77a0a7ed8733c5a19f4a",
                "sha256": "614f35816242ae3881de8085c7511114f7c8ea91fa6cc7fa02b52c99d1c4fb87"
            },
            "downloads": -1,
            "filename": "indoxMiner-0.1.4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "60422b3b81ac77a0a7ed8733c5a19f4a",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 80424,
            "upload_time": "2024-12-29T09:52:39",
            "upload_time_iso_8601": "2024-12-29T09:52:39.103672Z",
            "url": "https://files.pythonhosted.org/packages/8f/22/131f868d45e781190788910c88b546d3e056d9465e0d1cfefd1efff4b1f5/indoxMiner-0.1.4-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "12bf102af2914aa3cfeba8b8406192aead57f377580f3ccc627087c77eb0bc84",
                "md5": "4ca885fc4565de273696d791132c639d",
                "sha256": "8ade4f4513f9da1728b6c1a40dcb54846170bd46dfa8d2f27a3dc0884f9d856b"
            },
            "downloads": -1,
            "filename": "indoxminer-0.1.4.tar.gz",
            "has_sig": false,
            "md5_digest": "4ca885fc4565de273696d791132c639d",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 65640,
            "upload_time": "2024-12-29T09:52:42",
            "upload_time_iso_8601": "2024-12-29T09:52:42.021622Z",
            "url": "https://files.pythonhosted.org/packages/12/bf/102af2914aa3cfeba8b8406192aead57f377580f3ccc627087c77eb0bc84/indoxminer-0.1.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-12-29 09:52:42",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "osllmai",
    "github_project": "inDox",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "indoxminer"
}
        
Elapsed time: 8.99573s