indoxMiner


NameindoxMiner JSON
Version 0.0.7 PyPI version JSON
download
home_pagehttps://github.com/osllmai/inDox
SummaryIndox Data Extraction
upload_time2024-11-05 08:05:52
maintainerNone
docs_urlNone
authornerdstudio
requires_python>=3.9
licenseAGPL-3.0
keywords document-processing information-extraction llm openai pdf text-processing natural-language-processing
VCS
bugtrack_url
requirements latex2markdown loguru numpy pandas protobuf pydantic PyPDF2 python-dotenv Requests setuptools tenacity tiktoken tokenizers umap_learn unstructured nltk pillow_heif
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # IndoxMiner

[![PyPI version](https://badge.fury.io/py/indoxminer.svg)](https://badge.fury.io/py/indoxminer)
[![License: MIT](https://img.shields.io/badge/License-AGPL-yellow.svg)](https://opensource.org/licenses/AGPL)

IndoxMiner is a powerful Python library that leverages Large Language Models (LLMs) to extract structured information from unstructured data sources including text, PDFs, and images. Using a flexible schema-based approach, it enables precise data extraction, validation, and transformation, making it ideal for automating document processing workflows.

## 🚀 Key Features

- **Multi-Format Support**: Extract data from text, PDFs, images, and scanned documents
- **Schema-Based Extraction**: Define custom schemas to specify exactly what data to extract
- **LLM Integration**: Seamless integration with OpenAI models for intelligent extraction
- **Validation & Type Safety**: Built-in validation rules and type-safe field definitions
- **Flexible Output**: Export to JSON, pandas DataFrames, or custom formats
- **Async Support**: Built for scalability with asynchronous processing capabilities
- **OCR Integration**: Multiple OCR engine options for image-based text extraction
- **High-Resolution Support**: Enhanced processing for high-quality PDFs
- **Error Handling**: Comprehensive error handling and validation reporting

## 📦 Installation

```bash
pip install indoxminer
```

## 🎯 Quick Start

### Basic Text Extraction

```python
from indoxminer import ExtractorSchema, Field, FieldType, ValidationRule, Extractor, OpenAi

# Initialize OpenAI extractor
llm_extractor = OpenAi(
    api_key="your-api-key",
    model="gpt-4-mini"
)

# Define extraction schema
schema = ExtractorSchema(
    fields=[
        Field(
            name="product_name",
            description="Product name",
            field_type=FieldType.STRING,
            rules=ValidationRule(min_length=2)
        ),
        Field(
            name="price",
            description="Price in USD",
            field_type=FieldType.FLOAT,
            rules=ValidationRule(min_value=0)
        )
    ]
)

# Create extractor and process text
extractor = Extractor(llm=llm_extractor, schema=schema)
text = """
MacBook Pro 16-inch with M2 chip
Price: $2,399.99
In stock: Yes
"""

# Extract and convert to DataFrame
result = await extractor.extract(text)
df = extractor.to_dataframe(result)
```

### PDF Processing

```python
from indoxminer import DocumentProcessor, ProcessingConfig

# Initialize processor with custom config
processor = DocumentProcessor(
    files=["invoice.pdf"],
    config=ProcessingConfig(
        hi_res_pdf=True,
        chunk_size=1000
    )
)

# Process document
documents = processor.process()

# Extract structured data
schema = ExtractorSchema(
    fields=[
        Field(
            name="bill_to",
            description="Billing address",
            field_type=FieldType.STRING
        ),
        Field(
            name="invoice_date",
            description="Invoice date",
            field_type=FieldType.DATE
        ),
        Field(
            name="total_amount",
            description="Total amount in USD",
            field_type=FieldType.FLOAT
        )
    ]
)

results = await extractor.extract(documents)
```

### Image Processing with OCR

```python
# Configure OCR-enabled processor
config = ProcessingConfig(
    ocr_enabled=True,
    ocr_engine="easyocr",  # or "tesseract", "paddle"
    language="en"
)

processor = DocumentProcessor(
    files=["receipt.jpg"],
    config=config
)

# Process image and extract text
documents = processor.process()
```

## 🔧 Core Components

### ExtractorSchema

Defines the structure of data to be extracted:

- Field definitions
- Validation rules
- Output format specifications

```python
schema = ExtractorSchema(
    fields=[...],
    output_format="json"
)
```

### Field Types

Supported field types:

- `STRING`: Text data
- `INTEGER`: Whole numbers
- `FLOAT`: Decimal numbers
- `DATE`: Date values
- `BOOLEAN`: True/False values
- `LIST`: Arrays of values
- `DICT`: Nested objects

### Validation Rules

Available validation options:

- `min_length`/`max_length`: String length constraints
- `min_value`/`max_value`: Numeric bounds
- `pattern`: Regex patterns
- `required`: Required fields
- `custom`: Custom validation functions

## ⚙️ Configuration Options

### ProcessingConfig

```python
config = ProcessingConfig(
    hi_res_pdf=True,          # High-resolution PDF processing
    ocr_enabled=True,         # Enable OCR
    ocr_engine="tesseract",   # OCR engine selection
    chunk_size=1000,          # Text chunk size
    language="en",            # Processing language
    max_threads=4             # Parallel processing threads
)
```

## 🔍 Error Handling

IndoxMiner provides detailed error reporting:

```python
results = await extractor.extract(documents)

if not results.is_valid:
    for chunk_idx, errors in results.validation_errors.items():
        print(f"Errors in chunk {chunk_idx}:")
        for error in errors:
            print(f"- {error.field}: {error.message}")

# Access valid results
valid_data = results.get_valid_results()
```

## 🤝 Contributing

We welcome contributions! To contribute:

1. Fork the repository
2. Create a feature branch
3. Commit your changes
4. Push to the branch
5. Open a Pull Request

Please read our [Contributing Guidelines](CONTRIBUTING.md) for more details.

## 📄 License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## 🆘 Support

- **Documentation**: [Full documentation](https://indoxminer.readthedocs.io/)
- **Issues**: [GitHub Issues](https://github.com/username/indoxminer/issues)
- **Discussions**: [GitHub Discussions](https://github.com/username/indoxminer/discussions)

## 🌟 Star History

[![Star History Chart](https://api.star-history.com/svg?repos=username/indoxminer&type=Date)](https://star-history.com/#username/indoxminer&Date)

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/osllmai/inDox",
    "name": "indoxMiner",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "document-processing, information-extraction, llm, openai, pdf, text-processing, natural-language-processing",
    "author": "nerdstudio",
    "author_email": "ashkan@nematifamilyfundation.onmicrosoft.com",
    "download_url": "https://files.pythonhosted.org/packages/0f/b2/d2939a0e7e9d9b37e514b0da20c70df0b966fc31a3b35fa449ad2a5dc8be/indoxminer-0.0.7.tar.gz",
    "platform": null,
    "description": "# IndoxMiner\r\n\r\n[![PyPI version](https://badge.fury.io/py/indoxminer.svg)](https://badge.fury.io/py/indoxminer)\r\n[![License: MIT](https://img.shields.io/badge/License-AGPL-yellow.svg)](https://opensource.org/licenses/AGPL)\r\n\r\nIndoxMiner is a powerful Python library that leverages Large Language Models (LLMs) to extract structured information from unstructured data sources including text, PDFs, and images. Using a flexible schema-based approach, it enables precise data extraction, validation, and transformation, making it ideal for automating document processing workflows.\r\n\r\n## \ud83d\ude80 Key Features\r\n\r\n- **Multi-Format Support**: Extract data from text, PDFs, images, and scanned documents\r\n- **Schema-Based Extraction**: Define custom schemas to specify exactly what data to extract\r\n- **LLM Integration**: Seamless integration with OpenAI models for intelligent extraction\r\n- **Validation & Type Safety**: Built-in validation rules and type-safe field definitions\r\n- **Flexible Output**: Export to JSON, pandas DataFrames, or custom formats\r\n- **Async Support**: Built for scalability with asynchronous processing capabilities\r\n- **OCR Integration**: Multiple OCR engine options for image-based text extraction\r\n- **High-Resolution Support**: Enhanced processing for high-quality PDFs\r\n- **Error Handling**: Comprehensive error handling and validation reporting\r\n\r\n## \ud83d\udce6 Installation\r\n\r\n```bash\r\npip install indoxminer\r\n```\r\n\r\n## \ud83c\udfaf Quick Start\r\n\r\n### Basic Text Extraction\r\n\r\n```python\r\nfrom indoxminer import ExtractorSchema, Field, FieldType, ValidationRule, Extractor, OpenAi\r\n\r\n# Initialize OpenAI extractor\r\nllm_extractor = OpenAi(\r\n    api_key=\"your-api-key\",\r\n    model=\"gpt-4-mini\"\r\n)\r\n\r\n# Define extraction schema\r\nschema = ExtractorSchema(\r\n    fields=[\r\n        Field(\r\n            name=\"product_name\",\r\n            description=\"Product name\",\r\n            field_type=FieldType.STRING,\r\n            rules=ValidationRule(min_length=2)\r\n        ),\r\n        Field(\r\n            name=\"price\",\r\n            description=\"Price in USD\",\r\n            field_type=FieldType.FLOAT,\r\n            rules=ValidationRule(min_value=0)\r\n        )\r\n    ]\r\n)\r\n\r\n# Create extractor and process text\r\nextractor = Extractor(llm=llm_extractor, schema=schema)\r\ntext = \"\"\"\r\nMacBook Pro 16-inch with M2 chip\r\nPrice: $2,399.99\r\nIn stock: Yes\r\n\"\"\"\r\n\r\n# Extract and convert to DataFrame\r\nresult = await extractor.extract(text)\r\ndf = extractor.to_dataframe(result)\r\n```\r\n\r\n### PDF Processing\r\n\r\n```python\r\nfrom indoxminer import DocumentProcessor, ProcessingConfig\r\n\r\n# Initialize processor with custom config\r\nprocessor = DocumentProcessor(\r\n    files=[\"invoice.pdf\"],\r\n    config=ProcessingConfig(\r\n        hi_res_pdf=True,\r\n        chunk_size=1000\r\n    )\r\n)\r\n\r\n# Process document\r\ndocuments = processor.process()\r\n\r\n# Extract structured data\r\nschema = ExtractorSchema(\r\n    fields=[\r\n        Field(\r\n            name=\"bill_to\",\r\n            description=\"Billing address\",\r\n            field_type=FieldType.STRING\r\n        ),\r\n        Field(\r\n            name=\"invoice_date\",\r\n            description=\"Invoice date\",\r\n            field_type=FieldType.DATE\r\n        ),\r\n        Field(\r\n            name=\"total_amount\",\r\n            description=\"Total amount in USD\",\r\n            field_type=FieldType.FLOAT\r\n        )\r\n    ]\r\n)\r\n\r\nresults = await extractor.extract(documents)\r\n```\r\n\r\n### Image Processing with OCR\r\n\r\n```python\r\n# Configure OCR-enabled processor\r\nconfig = ProcessingConfig(\r\n    ocr_enabled=True,\r\n    ocr_engine=\"easyocr\",  # or \"tesseract\", \"paddle\"\r\n    language=\"en\"\r\n)\r\n\r\nprocessor = DocumentProcessor(\r\n    files=[\"receipt.jpg\"],\r\n    config=config\r\n)\r\n\r\n# Process image and extract text\r\ndocuments = processor.process()\r\n```\r\n\r\n## \ud83d\udd27 Core Components\r\n\r\n### ExtractorSchema\r\n\r\nDefines the structure of data to be extracted:\r\n\r\n- Field definitions\r\n- Validation rules\r\n- Output format specifications\r\n\r\n```python\r\nschema = ExtractorSchema(\r\n    fields=[...],\r\n    output_format=\"json\"\r\n)\r\n```\r\n\r\n### Field Types\r\n\r\nSupported field types:\r\n\r\n- `STRING`: Text data\r\n- `INTEGER`: Whole numbers\r\n- `FLOAT`: Decimal numbers\r\n- `DATE`: Date values\r\n- `BOOLEAN`: True/False values\r\n- `LIST`: Arrays of values\r\n- `DICT`: Nested objects\r\n\r\n### Validation Rules\r\n\r\nAvailable validation options:\r\n\r\n- `min_length`/`max_length`: String length constraints\r\n- `min_value`/`max_value`: Numeric bounds\r\n- `pattern`: Regex patterns\r\n- `required`: Required fields\r\n- `custom`: Custom validation functions\r\n\r\n## \u2699\ufe0f Configuration Options\r\n\r\n### ProcessingConfig\r\n\r\n```python\r\nconfig = ProcessingConfig(\r\n    hi_res_pdf=True,          # High-resolution PDF processing\r\n    ocr_enabled=True,         # Enable OCR\r\n    ocr_engine=\"tesseract\",   # OCR engine selection\r\n    chunk_size=1000,          # Text chunk size\r\n    language=\"en\",            # Processing language\r\n    max_threads=4             # Parallel processing threads\r\n)\r\n```\r\n\r\n## \ud83d\udd0d Error Handling\r\n\r\nIndoxMiner provides detailed error reporting:\r\n\r\n```python\r\nresults = await extractor.extract(documents)\r\n\r\nif not results.is_valid:\r\n    for chunk_idx, errors in results.validation_errors.items():\r\n        print(f\"Errors in chunk {chunk_idx}:\")\r\n        for error in errors:\r\n            print(f\"- {error.field}: {error.message}\")\r\n\r\n# Access valid results\r\nvalid_data = results.get_valid_results()\r\n```\r\n\r\n## \ud83e\udd1d Contributing\r\n\r\nWe welcome contributions! To contribute:\r\n\r\n1. Fork the repository\r\n2. Create a feature branch\r\n3. Commit your changes\r\n4. Push to the branch\r\n5. Open a Pull Request\r\n\r\nPlease read our [Contributing Guidelines](CONTRIBUTING.md) for more details.\r\n\r\n## \ud83d\udcc4 License\r\n\r\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\r\n\r\n## \ud83c\udd98 Support\r\n\r\n- **Documentation**: [Full documentation](https://indoxminer.readthedocs.io/)\r\n- **Issues**: [GitHub Issues](https://github.com/username/indoxminer/issues)\r\n- **Discussions**: [GitHub Discussions](https://github.com/username/indoxminer/discussions)\r\n\r\n## \ud83c\udf1f Star History\r\n\r\n[![Star History Chart](https://api.star-history.com/svg?repos=username/indoxminer&type=Date)](https://star-history.com/#username/indoxminer&Date)\r\n",
    "bugtrack_url": null,
    "license": "AGPL-3.0",
    "summary": "Indox Data Extraction",
    "version": "0.0.7",
    "project_urls": {
        "Homepage": "https://github.com/osllmai/inDox"
    },
    "split_keywords": [
        "document-processing",
        " information-extraction",
        " llm",
        " openai",
        " pdf",
        " text-processing",
        " natural-language-processing"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "fd1fcc828b09b0876dfb60f21c41ce2a2c6a6b21d55ceeb2f3ec860c046dde75",
                "md5": "19ce5e9eb0c4f879e0064a8b0c611ae9",
                "sha256": "996241f7085c9a40bb18c6373817063440761eef1c71f35c64a80dab2262bab3"
            },
            "downloads": -1,
            "filename": "indoxMiner-0.0.7-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "19ce5e9eb0c4f879e0064a8b0c611ae9",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 36959,
            "upload_time": "2024-11-05T08:05:50",
            "upload_time_iso_8601": "2024-11-05T08:05:50.796207Z",
            "url": "https://files.pythonhosted.org/packages/fd/1f/cc828b09b0876dfb60f21c41ce2a2c6a6b21d55ceeb2f3ec860c046dde75/indoxMiner-0.0.7-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "0fb2d2939a0e7e9d9b37e514b0da20c70df0b966fc31a3b35fa449ad2a5dc8be",
                "md5": "bfa08e7a911dcb478192d38b5e13339f",
                "sha256": "9536fc8e38abc8d0ee611e30f75ac1eaac786efe15ddc728d52e16850398fffe"
            },
            "downloads": -1,
            "filename": "indoxminer-0.0.7.tar.gz",
            "has_sig": false,
            "md5_digest": "bfa08e7a911dcb478192d38b5e13339f",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 37339,
            "upload_time": "2024-11-05T08:05:52",
            "upload_time_iso_8601": "2024-11-05T08:05:52.543599Z",
            "url": "https://files.pythonhosted.org/packages/0f/b2/d2939a0e7e9d9b37e514b0da20c70df0b966fc31a3b35fa449ad2a5dc8be/indoxminer-0.0.7.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-11-05 08:05:52",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "osllmai",
    "github_project": "inDox",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "latex2markdown",
            "specs": [
                [
                    "==",
                    "0.2.1"
                ]
            ]
        },
        {
            "name": "loguru",
            "specs": [
                [
                    "==",
                    "0.7.2"
                ]
            ]
        },
        {
            "name": "numpy",
            "specs": [
                [
                    "==",
                    "2.0.0"
                ]
            ]
        },
        {
            "name": "pandas",
            "specs": [
                [
                    "==",
                    "2.0.3"
                ]
            ]
        },
        {
            "name": "protobuf",
            "specs": [
                [
                    "==",
                    "5.27.2"
                ]
            ]
        },
        {
            "name": "pydantic",
            "specs": [
                [
                    "==",
                    "2.8.2"
                ]
            ]
        },
        {
            "name": "PyPDF2",
            "specs": [
                [
                    "==",
                    "3.0.1"
                ]
            ]
        },
        {
            "name": "python-dotenv",
            "specs": [
                [
                    "==",
                    "1.0.1"
                ]
            ]
        },
        {
            "name": "Requests",
            "specs": [
                [
                    "==",
                    "2.32.3"
                ]
            ]
        },
        {
            "name": "setuptools",
            "specs": [
                [
                    "==",
                    "69.1.1"
                ]
            ]
        },
        {
            "name": "tenacity",
            "specs": [
                [
                    "==",
                    "8.2.3"
                ]
            ]
        },
        {
            "name": "tiktoken",
            "specs": [
                [
                    "==",
                    "0.6.0"
                ]
            ]
        },
        {
            "name": "tokenizers",
            "specs": [
                [
                    "==",
                    "0.15.2"
                ]
            ]
        },
        {
            "name": "umap_learn",
            "specs": [
                [
                    "==",
                    "0.5.6"
                ]
            ]
        },
        {
            "name": "unstructured",
            "specs": [
                [
                    "==",
                    "0.15.8"
                ]
            ]
        },
        {
            "name": "nltk",
            "specs": [
                [
                    "==",
                    "3.9.1"
                ]
            ]
        },
        {
            "name": "pillow_heif",
            "specs": [
                [
                    "==",
                    "0.18.0"
                ]
            ]
        }
    ],
    "lcname": "indoxminer"
}
        
Elapsed time: 0.46350s