nanonets-extractor


Namenanonets-extractor JSON
Version 0.1.4 PyPI version JSON
download
home_pagehttps://github.com/nanonets/document-extractor
SummaryA unified document extraction library supporting local CPU, GPU, and cloud processing
upload_time2025-07-23 11:17:54
maintainerNone
docs_urlNone
authorNanonets
requires_python>=3.8
licenseMIT
keywords document extraction ocr pdf ai machine-learning
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Nanonets Document Extractor

A Python library for extracting data from any document using AI. Supports cloud API, local CPU, and GPU processing.

## Quick Start

### Installation

```bash
# For cloud processing only (recommended)
pip install nanonets-extractor

# For local CPU processing
pip install nanonets-extractor[cpu]

# For local GPU processing  
pip install nanonets-extractor[gpu]
```

### Get Your Free API Key
Get your free API key from [https://app.nanonets.com/#/keys](https://app.nanonets.com/#/keys)

## Usage

### Basic Example

```python
from nanonets_extractor import DocumentExtractor

# Initialize extractor
extractor = DocumentExtractor(
    mode="cloud",
    api_key="your_api_key_here"
)

# Extract data from any document
result = extractor.extract(
    file_path="invoice.pdf",
    output_type="flat-json"
)

print(result)
```

## Initialization Parameters

### DocumentExtractor()

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `mode` | str | Yes | Processing mode: `"cloud"`, `"cpu"`, or `"gpu"` |
| `api_key` | str | Yes (cloud mode) | Your Nanonets API key |
| `model_path` | str | No | Custom model path for local processing |
| `device` | str | No | GPU device (e.g., "cuda:0") for GPU mode |

### Processing Modes

#### 1. Cloud Mode (Recommended)
```python
extractor = DocumentExtractor(
    mode="cloud",
    api_key="your_api_key"
)
```
- ✅ No setup required
- ✅ Fastest processing
- ✅ Most accurate
- ✅ Supports all document types

#### 2. CPU Mode
```python
extractor = DocumentExtractor(mode="cpu")
```
- ✅ Works offline
- ⚠️ Slower processing
- ⚠️ Requires local dependencies

#### 3. GPU Mode
```python
extractor = DocumentExtractor(
    mode="gpu",
    device="cuda:0"  # optional
)
```
- ✅ Faster than CPU
- ✅ Works offline
- ⚠️ Requires CUDA-capable GPU

## Extract Method

### extractor.extract()

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `file_path` | str | Yes | Path to your document |
| `output_type` | str | No | Output format (default: "flat-json") |
| `specified_fields` | list | No | Extract only specific fields |
| `json_schema` | dict | No | Custom JSON schema for output |

### Output Types

| Type | Description | Example |
|------|-------------|---------|
| `"flat-json"` | Simple key-value pairs | `{"invoice_number": "123", "total": "100.00"}` |
| `"markdown"` | Formatted markdown text | `# Invoice\n**Total:** $100.00` |
| `"specified-fields"` | Only requested fields | Must provide `specified_fields` parameter |
| `"specified-json"` | Custom JSON structure | Must provide `json_schema` parameter |

## Supported Document Types

Works with **any document type**:
- 📄 **PDFs** - Invoices, contracts, reports
- 🖼️ **Images** - Screenshots, photos, scans  
- 📊 **Spreadsheets** - Excel, CSV files
- 📝 **Text Documents** - Word docs, text files
- 🆔 **ID Documents** - Passports, licenses, certificates
- 🧾 **Receipts** - Any receipt or bill
- 📋 **Forms** - Tax forms, applications, surveys

## Examples

### Extract Invoice Data
```python
extractor = DocumentExtractor(mode="cloud", api_key="your_key")

result = extractor.extract(
    file_path="invoice.pdf",
    output_type="flat-json"
)
# Returns: {"invoice_number": "INV-001", "total": "150.00", "date": "2024-01-15", ...}
```

### Extract Specific Fields
```python
result = extractor.extract(
    file_path="resume.pdf",
    output_type="specified-fields",
    specified_fields=["name", "email", "phone", "experience"]
)
# Returns: {"name": "John Doe", "email": "john@email.com", ...}
```

### Get Markdown Output
```python
result = extractor.extract(
    file_path="report.pdf",
    output_type="markdown"
)
# Returns formatted markdown text
```

### Custom JSON Schema
```python
schema = {
    "personal_info": {
        "name": "string",
        "email": "string"
    },
    "skills": ["string"]
}

result = extractor.extract(
    file_path="resume.pdf",
    output_type="specified-json",
    json_schema=schema
)
```

## Command Line Usage

```bash
# Extract to JSON
nanonets-extractor document.pdf --output-type flat-json

# Extract specific fields
nanonets-extractor invoice.pdf --output-type specified-fields --fields invoice_number,total,date

# Use cloud API
nanonets-extractor document.pdf --mode cloud --api-key your_key

# Save to file
nanonets-extractor document.pdf --output result.json
```

## Error Handling

```python
from nanonets_extractor import DocumentExtractor
from nanonets_extractor.exceptions import ExtractionError, APIError

try:
    extractor = DocumentExtractor(mode="cloud", api_key="your_key")
    result = extractor.extract("document.pdf")
    print(result)
except APIError as e:
    print(f"API error: {e}")
except ExtractionError as e:
    print(f"Extraction failed: {e}")
```

## Environment Variables

Set your API key as an environment variable:

```bash
export NANONETS_API_KEY="your_api_key_here"
```

Then use without specifying the key:
```python
extractor = DocumentExtractor(mode="cloud")  # Uses env variable
```

## License

MIT License - see LICENSE file for details.

## Support

- 📧 Email: support@nanonets.com
- 🌐 Website: https://nanonets.com
- 📖 Documentation: https://nanonets.com/documentation 

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/nanonets/document-extractor",
    "name": "nanonets-extractor",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "Nanonets <support@nanonets.com>",
    "keywords": "document, extraction, ocr, pdf, ai, machine-learning",
    "author": "Nanonets",
    "author_email": "Nanonets <support@nanonets.com>",
    "download_url": "https://files.pythonhosted.org/packages/e8/ba/9abfed7374b856ae8a45aa7e7474299b5e239499698e7300818e8fa17454/nanonets_extractor-0.1.4.tar.gz",
    "platform": null,
    "description": "# Nanonets Document Extractor\n\nA Python library for extracting data from any document using AI. Supports cloud API, local CPU, and GPU processing.\n\n## Quick Start\n\n### Installation\n\n```bash\n# For cloud processing only (recommended)\npip install nanonets-extractor\n\n# For local CPU processing\npip install nanonets-extractor[cpu]\n\n# For local GPU processing  \npip install nanonets-extractor[gpu]\n```\n\n### Get Your Free API Key\nGet your free API key from [https://app.nanonets.com/#/keys](https://app.nanonets.com/#/keys)\n\n## Usage\n\n### Basic Example\n\n```python\nfrom nanonets_extractor import DocumentExtractor\n\n# Initialize extractor\nextractor = DocumentExtractor(\n    mode=\"cloud\",\n    api_key=\"your_api_key_here\"\n)\n\n# Extract data from any document\nresult = extractor.extract(\n    file_path=\"invoice.pdf\",\n    output_type=\"flat-json\"\n)\n\nprint(result)\n```\n\n## Initialization Parameters\n\n### DocumentExtractor()\n\n| Parameter | Type | Required | Description |\n|-----------|------|----------|-------------|\n| `mode` | str | Yes | Processing mode: `\"cloud\"`, `\"cpu\"`, or `\"gpu\"` |\n| `api_key` | str | Yes (cloud mode) | Your Nanonets API key |\n| `model_path` | str | No | Custom model path for local processing |\n| `device` | str | No | GPU device (e.g., \"cuda:0\") for GPU mode |\n\n### Processing Modes\n\n#### 1. Cloud Mode (Recommended)\n```python\nextractor = DocumentExtractor(\n    mode=\"cloud\",\n    api_key=\"your_api_key\"\n)\n```\n- \u2705 No setup required\n- \u2705 Fastest processing\n- \u2705 Most accurate\n- \u2705 Supports all document types\n\n#### 2. CPU Mode\n```python\nextractor = DocumentExtractor(mode=\"cpu\")\n```\n- \u2705 Works offline\n- \u26a0\ufe0f Slower processing\n- \u26a0\ufe0f Requires local dependencies\n\n#### 3. GPU Mode\n```python\nextractor = DocumentExtractor(\n    mode=\"gpu\",\n    device=\"cuda:0\"  # optional\n)\n```\n- \u2705 Faster than CPU\n- \u2705 Works offline\n- \u26a0\ufe0f Requires CUDA-capable GPU\n\n## Extract Method\n\n### extractor.extract()\n\n| Parameter | Type | Required | Description |\n|-----------|------|----------|-------------|\n| `file_path` | str | Yes | Path to your document |\n| `output_type` | str | No | Output format (default: \"flat-json\") |\n| `specified_fields` | list | No | Extract only specific fields |\n| `json_schema` | dict | No | Custom JSON schema for output |\n\n### Output Types\n\n| Type | Description | Example |\n|------|-------------|---------|\n| `\"flat-json\"` | Simple key-value pairs | `{\"invoice_number\": \"123\", \"total\": \"100.00\"}` |\n| `\"markdown\"` | Formatted markdown text | `# Invoice\\n**Total:** $100.00` |\n| `\"specified-fields\"` | Only requested fields | Must provide `specified_fields` parameter |\n| `\"specified-json\"` | Custom JSON structure | Must provide `json_schema` parameter |\n\n## Supported Document Types\n\nWorks with **any document type**:\n- \ud83d\udcc4 **PDFs** - Invoices, contracts, reports\n- \ud83d\uddbc\ufe0f **Images** - Screenshots, photos, scans  \n- \ud83d\udcca **Spreadsheets** - Excel, CSV files\n- \ud83d\udcdd **Text Documents** - Word docs, text files\n- \ud83c\udd94 **ID Documents** - Passports, licenses, certificates\n- \ud83e\uddfe **Receipts** - Any receipt or bill\n- \ud83d\udccb **Forms** - Tax forms, applications, surveys\n\n## Examples\n\n### Extract Invoice Data\n```python\nextractor = DocumentExtractor(mode=\"cloud\", api_key=\"your_key\")\n\nresult = extractor.extract(\n    file_path=\"invoice.pdf\",\n    output_type=\"flat-json\"\n)\n# Returns: {\"invoice_number\": \"INV-001\", \"total\": \"150.00\", \"date\": \"2024-01-15\", ...}\n```\n\n### Extract Specific Fields\n```python\nresult = extractor.extract(\n    file_path=\"resume.pdf\",\n    output_type=\"specified-fields\",\n    specified_fields=[\"name\", \"email\", \"phone\", \"experience\"]\n)\n# Returns: {\"name\": \"John Doe\", \"email\": \"john@email.com\", ...}\n```\n\n### Get Markdown Output\n```python\nresult = extractor.extract(\n    file_path=\"report.pdf\",\n    output_type=\"markdown\"\n)\n# Returns formatted markdown text\n```\n\n### Custom JSON Schema\n```python\nschema = {\n    \"personal_info\": {\n        \"name\": \"string\",\n        \"email\": \"string\"\n    },\n    \"skills\": [\"string\"]\n}\n\nresult = extractor.extract(\n    file_path=\"resume.pdf\",\n    output_type=\"specified-json\",\n    json_schema=schema\n)\n```\n\n## Command Line Usage\n\n```bash\n# Extract to JSON\nnanonets-extractor document.pdf --output-type flat-json\n\n# Extract specific fields\nnanonets-extractor invoice.pdf --output-type specified-fields --fields invoice_number,total,date\n\n# Use cloud API\nnanonets-extractor document.pdf --mode cloud --api-key your_key\n\n# Save to file\nnanonets-extractor document.pdf --output result.json\n```\n\n## Error Handling\n\n```python\nfrom nanonets_extractor import DocumentExtractor\nfrom nanonets_extractor.exceptions import ExtractionError, APIError\n\ntry:\n    extractor = DocumentExtractor(mode=\"cloud\", api_key=\"your_key\")\n    result = extractor.extract(\"document.pdf\")\n    print(result)\nexcept APIError as e:\n    print(f\"API error: {e}\")\nexcept ExtractionError as e:\n    print(f\"Extraction failed: {e}\")\n```\n\n## Environment Variables\n\nSet your API key as an environment variable:\n\n```bash\nexport NANONETS_API_KEY=\"your_api_key_here\"\n```\n\nThen use without specifying the key:\n```python\nextractor = DocumentExtractor(mode=\"cloud\")  # Uses env variable\n```\n\n## License\n\nMIT License - see LICENSE file for details.\n\n## Support\n\n- \ud83d\udce7 Email: support@nanonets.com\n- \ud83c\udf10 Website: https://nanonets.com\n- \ud83d\udcd6 Documentation: https://nanonets.com/documentation \n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A unified document extraction library supporting local CPU, GPU, and cloud processing",
    "version": "0.1.4",
    "project_urls": {
        "API Keys": "https://app.nanonets.com/#/keys",
        "Bug Tracker": "https://github.com/nanonets/document-extractor/issues",
        "Documentation": "https://docs.nanonets.com",
        "Homepage": "https://github.com/nanonets/document-extractor",
        "Repository": "https://github.com/nanonets/document-extractor"
    },
    "split_keywords": [
        "document",
        " extraction",
        " ocr",
        " pdf",
        " ai",
        " machine-learning"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "078f81a60cf47e22af1f99e4da9c9fc3a251b91ce529281d0806469b53f3d37f",
                "md5": "71b25fff6a67d5de2fc0b6068d4492ff",
                "sha256": "aa120562351a49947cdaddf0e30248ea3710c692440ec37727586d4017f28185"
            },
            "downloads": -1,
            "filename": "nanonets_extractor-0.1.4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "71b25fff6a67d5de2fc0b6068d4492ff",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 21811,
            "upload_time": "2025-07-23T11:17:52",
            "upload_time_iso_8601": "2025-07-23T11:17:52.915956Z",
            "url": "https://files.pythonhosted.org/packages/07/8f/81a60cf47e22af1f99e4da9c9fc3a251b91ce529281d0806469b53f3d37f/nanonets_extractor-0.1.4-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "e8ba9abfed7374b856ae8a45aa7e7474299b5e239499698e7300818e8fa17454",
                "md5": "01a3a8997c8b292f3e4cce898a3ac87e",
                "sha256": "efe9581a6cb2b1f9c2f610ac836a3ccb692cc98904ddbc898c6c03423a77bfef"
            },
            "downloads": -1,
            "filename": "nanonets_extractor-0.1.4.tar.gz",
            "has_sig": false,
            "md5_digest": "01a3a8997c8b292f3e4cce898a3ac87e",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 31411,
            "upload_time": "2025-07-23T11:17:54",
            "upload_time_iso_8601": "2025-07-23T11:17:54.574131Z",
            "url": "https://files.pythonhosted.org/packages/e8/ba/9abfed7374b856ae8a45aa7e7474299b5e239499698e7300818e8fa17454/nanonets_extractor-0.1.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-23 11:17:54",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "nanonets",
    "github_project": "document-extractor",
    "github_not_found": true,
    "lcname": "nanonets-extractor"
}
        
Elapsed time: 0.77083s