# Nanonets Document Extractor
A Python library for extracting data from any document using AI. Supports cloud API, local CPU, and GPU processing.
## Quick Start
### Installation
```bash
# For cloud processing only (recommended)
pip install nanonets-extractor
# For local CPU processing
pip install nanonets-extractor[cpu]
# For local GPU processing
pip install nanonets-extractor[gpu]
```
### Get Your Free API Key
Get your free API key from [https://app.nanonets.com/#/keys](https://app.nanonets.com/#/keys)
## Usage
### Basic Example
```python
from nanonets_extractor import DocumentExtractor
# Initialize extractor
extractor = DocumentExtractor(
mode="cloud",
api_key="your_api_key_here"
)
# Extract data from any document
result = extractor.extract(
file_path="invoice.pdf",
output_type="flat-json"
)
print(result)
```
## Initialization Parameters
### DocumentExtractor()
| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `mode` | str | Yes | Processing mode: `"cloud"`, `"cpu"`, or `"gpu"` |
| `api_key` | str | Yes (cloud mode) | Your Nanonets API key |
| `model_path` | str | No | Custom model path for local processing |
| `device` | str | No | GPU device (e.g., "cuda:0") for GPU mode |
### Processing Modes
#### 1. Cloud Mode (Recommended)
```python
extractor = DocumentExtractor(
mode="cloud",
api_key="your_api_key"
)
```
- ✅ No setup required
- ✅ Fastest processing
- ✅ Most accurate
- ✅ Supports all document types
#### 2. CPU Mode
```python
extractor = DocumentExtractor(mode="cpu")
```
- ✅ Works offline
- ⚠️ Slower processing
- ⚠️ Requires local dependencies
#### 3. GPU Mode
```python
extractor = DocumentExtractor(
mode="gpu",
device="cuda:0" # optional
)
```
- ✅ Faster than CPU
- ✅ Works offline
- ⚠️ Requires CUDA-capable GPU
## Extract Method
### extractor.extract()
| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `file_path` | str | Yes | Path to your document |
| `output_type` | str | No | Output format (default: "flat-json") |
| `specified_fields` | list | No | Extract only specific fields |
| `json_schema` | dict | No | Custom JSON schema for output |
### Output Types
| Type | Description | Example |
|------|-------------|---------|
| `"flat-json"` | Simple key-value pairs | `{"invoice_number": "123", "total": "100.00"}` |
| `"markdown"` | Formatted markdown text | `# Invoice\n**Total:** $100.00` |
| `"specified-fields"` | Only requested fields | Must provide `specified_fields` parameter |
| `"specified-json"` | Custom JSON structure | Must provide `json_schema` parameter |
## Supported Document Types
Works with **any document type**:
- 📄 **PDFs** - Invoices, contracts, reports
- 🖼️ **Images** - Screenshots, photos, scans
- 📊 **Spreadsheets** - Excel, CSV files
- 📝 **Text Documents** - Word docs, text files
- 🆔 **ID Documents** - Passports, licenses, certificates
- 🧾 **Receipts** - Any receipt or bill
- 📋 **Forms** - Tax forms, applications, surveys
## Examples
### Extract Invoice Data
```python
extractor = DocumentExtractor(mode="cloud", api_key="your_key")
result = extractor.extract(
file_path="invoice.pdf",
output_type="flat-json"
)
# Returns: {"invoice_number": "INV-001", "total": "150.00", "date": "2024-01-15", ...}
```
### Extract Specific Fields
```python
result = extractor.extract(
file_path="resume.pdf",
output_type="specified-fields",
specified_fields=["name", "email", "phone", "experience"]
)
# Returns: {"name": "John Doe", "email": "john@email.com", ...}
```
### Get Markdown Output
```python
result = extractor.extract(
file_path="report.pdf",
output_type="markdown"
)
# Returns formatted markdown text
```
### Custom JSON Schema
```python
schema = {
"personal_info": {
"name": "string",
"email": "string"
},
"skills": ["string"]
}
result = extractor.extract(
file_path="resume.pdf",
output_type="specified-json",
json_schema=schema
)
```
## Command Line Usage
```bash
# Extract to JSON
nanonets-extractor document.pdf --output-type flat-json
# Extract specific fields
nanonets-extractor invoice.pdf --output-type specified-fields --fields invoice_number,total,date
# Use cloud API
nanonets-extractor document.pdf --mode cloud --api-key your_key
# Save to file
nanonets-extractor document.pdf --output result.json
```
## Error Handling
```python
from nanonets_extractor import DocumentExtractor
from nanonets_extractor.exceptions import ExtractionError, APIError
try:
extractor = DocumentExtractor(mode="cloud", api_key="your_key")
result = extractor.extract("document.pdf")
print(result)
except APIError as e:
print(f"API error: {e}")
except ExtractionError as e:
print(f"Extraction failed: {e}")
```
## Environment Variables
Set your API key as an environment variable:
```bash
export NANONETS_API_KEY="your_api_key_here"
```
Then use without specifying the key:
```python
extractor = DocumentExtractor(mode="cloud") # Uses env variable
```
## License
MIT License - see LICENSE file for details.
## Support
- 📧 Email: support@nanonets.com
- 🌐 Website: https://nanonets.com
- 📖 Documentation: https://nanonets.com/documentation
Raw data
{
"_id": null,
"home_page": "https://github.com/nanonets/document-extractor",
"name": "nanonets-extractor",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "Nanonets <support@nanonets.com>",
"keywords": "document, extraction, ocr, pdf, ai, machine-learning",
"author": "Nanonets",
"author_email": "Nanonets <support@nanonets.com>",
"download_url": "https://files.pythonhosted.org/packages/e8/ba/9abfed7374b856ae8a45aa7e7474299b5e239499698e7300818e8fa17454/nanonets_extractor-0.1.4.tar.gz",
"platform": null,
"description": "# Nanonets Document Extractor\n\nA Python library for extracting data from any document using AI. Supports cloud API, local CPU, and GPU processing.\n\n## Quick Start\n\n### Installation\n\n```bash\n# For cloud processing only (recommended)\npip install nanonets-extractor\n\n# For local CPU processing\npip install nanonets-extractor[cpu]\n\n# For local GPU processing \npip install nanonets-extractor[gpu]\n```\n\n### Get Your Free API Key\nGet your free API key from [https://app.nanonets.com/#/keys](https://app.nanonets.com/#/keys)\n\n## Usage\n\n### Basic Example\n\n```python\nfrom nanonets_extractor import DocumentExtractor\n\n# Initialize extractor\nextractor = DocumentExtractor(\n mode=\"cloud\",\n api_key=\"your_api_key_here\"\n)\n\n# Extract data from any document\nresult = extractor.extract(\n file_path=\"invoice.pdf\",\n output_type=\"flat-json\"\n)\n\nprint(result)\n```\n\n## Initialization Parameters\n\n### DocumentExtractor()\n\n| Parameter | Type | Required | Description |\n|-----------|------|----------|-------------|\n| `mode` | str | Yes | Processing mode: `\"cloud\"`, `\"cpu\"`, or `\"gpu\"` |\n| `api_key` | str | Yes (cloud mode) | Your Nanonets API key |\n| `model_path` | str | No | Custom model path for local processing |\n| `device` | str | No | GPU device (e.g., \"cuda:0\") for GPU mode |\n\n### Processing Modes\n\n#### 1. Cloud Mode (Recommended)\n```python\nextractor = DocumentExtractor(\n mode=\"cloud\",\n api_key=\"your_api_key\"\n)\n```\n- \u2705 No setup required\n- \u2705 Fastest processing\n- \u2705 Most accurate\n- \u2705 Supports all document types\n\n#### 2. CPU Mode\n```python\nextractor = DocumentExtractor(mode=\"cpu\")\n```\n- \u2705 Works offline\n- \u26a0\ufe0f Slower processing\n- \u26a0\ufe0f Requires local dependencies\n\n#### 3. GPU Mode\n```python\nextractor = DocumentExtractor(\n mode=\"gpu\",\n device=\"cuda:0\" # optional\n)\n```\n- \u2705 Faster than CPU\n- \u2705 Works offline\n- \u26a0\ufe0f Requires CUDA-capable GPU\n\n## Extract Method\n\n### extractor.extract()\n\n| Parameter | Type | Required | Description |\n|-----------|------|----------|-------------|\n| `file_path` | str | Yes | Path to your document |\n| `output_type` | str | No | Output format (default: \"flat-json\") |\n| `specified_fields` | list | No | Extract only specific fields |\n| `json_schema` | dict | No | Custom JSON schema for output |\n\n### Output Types\n\n| Type | Description | Example |\n|------|-------------|---------|\n| `\"flat-json\"` | Simple key-value pairs | `{\"invoice_number\": \"123\", \"total\": \"100.00\"}` |\n| `\"markdown\"` | Formatted markdown text | `# Invoice\\n**Total:** $100.00` |\n| `\"specified-fields\"` | Only requested fields | Must provide `specified_fields` parameter |\n| `\"specified-json\"` | Custom JSON structure | Must provide `json_schema` parameter |\n\n## Supported Document Types\n\nWorks with **any document type**:\n- \ud83d\udcc4 **PDFs** - Invoices, contracts, reports\n- \ud83d\uddbc\ufe0f **Images** - Screenshots, photos, scans \n- \ud83d\udcca **Spreadsheets** - Excel, CSV files\n- \ud83d\udcdd **Text Documents** - Word docs, text files\n- \ud83c\udd94 **ID Documents** - Passports, licenses, certificates\n- \ud83e\uddfe **Receipts** - Any receipt or bill\n- \ud83d\udccb **Forms** - Tax forms, applications, surveys\n\n## Examples\n\n### Extract Invoice Data\n```python\nextractor = DocumentExtractor(mode=\"cloud\", api_key=\"your_key\")\n\nresult = extractor.extract(\n file_path=\"invoice.pdf\",\n output_type=\"flat-json\"\n)\n# Returns: {\"invoice_number\": \"INV-001\", \"total\": \"150.00\", \"date\": \"2024-01-15\", ...}\n```\n\n### Extract Specific Fields\n```python\nresult = extractor.extract(\n file_path=\"resume.pdf\",\n output_type=\"specified-fields\",\n specified_fields=[\"name\", \"email\", \"phone\", \"experience\"]\n)\n# Returns: {\"name\": \"John Doe\", \"email\": \"john@email.com\", ...}\n```\n\n### Get Markdown Output\n```python\nresult = extractor.extract(\n file_path=\"report.pdf\",\n output_type=\"markdown\"\n)\n# Returns formatted markdown text\n```\n\n### Custom JSON Schema\n```python\nschema = {\n \"personal_info\": {\n \"name\": \"string\",\n \"email\": \"string\"\n },\n \"skills\": [\"string\"]\n}\n\nresult = extractor.extract(\n file_path=\"resume.pdf\",\n output_type=\"specified-json\",\n json_schema=schema\n)\n```\n\n## Command Line Usage\n\n```bash\n# Extract to JSON\nnanonets-extractor document.pdf --output-type flat-json\n\n# Extract specific fields\nnanonets-extractor invoice.pdf --output-type specified-fields --fields invoice_number,total,date\n\n# Use cloud API\nnanonets-extractor document.pdf --mode cloud --api-key your_key\n\n# Save to file\nnanonets-extractor document.pdf --output result.json\n```\n\n## Error Handling\n\n```python\nfrom nanonets_extractor import DocumentExtractor\nfrom nanonets_extractor.exceptions import ExtractionError, APIError\n\ntry:\n extractor = DocumentExtractor(mode=\"cloud\", api_key=\"your_key\")\n result = extractor.extract(\"document.pdf\")\n print(result)\nexcept APIError as e:\n print(f\"API error: {e}\")\nexcept ExtractionError as e:\n print(f\"Extraction failed: {e}\")\n```\n\n## Environment Variables\n\nSet your API key as an environment variable:\n\n```bash\nexport NANONETS_API_KEY=\"your_api_key_here\"\n```\n\nThen use without specifying the key:\n```python\nextractor = DocumentExtractor(mode=\"cloud\") # Uses env variable\n```\n\n## License\n\nMIT License - see LICENSE file for details.\n\n## Support\n\n- \ud83d\udce7 Email: support@nanonets.com\n- \ud83c\udf10 Website: https://nanonets.com\n- \ud83d\udcd6 Documentation: https://nanonets.com/documentation \n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A unified document extraction library supporting local CPU, GPU, and cloud processing",
"version": "0.1.4",
"project_urls": {
"API Keys": "https://app.nanonets.com/#/keys",
"Bug Tracker": "https://github.com/nanonets/document-extractor/issues",
"Documentation": "https://docs.nanonets.com",
"Homepage": "https://github.com/nanonets/document-extractor",
"Repository": "https://github.com/nanonets/document-extractor"
},
"split_keywords": [
"document",
" extraction",
" ocr",
" pdf",
" ai",
" machine-learning"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "078f81a60cf47e22af1f99e4da9c9fc3a251b91ce529281d0806469b53f3d37f",
"md5": "71b25fff6a67d5de2fc0b6068d4492ff",
"sha256": "aa120562351a49947cdaddf0e30248ea3710c692440ec37727586d4017f28185"
},
"downloads": -1,
"filename": "nanonets_extractor-0.1.4-py3-none-any.whl",
"has_sig": false,
"md5_digest": "71b25fff6a67d5de2fc0b6068d4492ff",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 21811,
"upload_time": "2025-07-23T11:17:52",
"upload_time_iso_8601": "2025-07-23T11:17:52.915956Z",
"url": "https://files.pythonhosted.org/packages/07/8f/81a60cf47e22af1f99e4da9c9fc3a251b91ce529281d0806469b53f3d37f/nanonets_extractor-0.1.4-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "e8ba9abfed7374b856ae8a45aa7e7474299b5e239499698e7300818e8fa17454",
"md5": "01a3a8997c8b292f3e4cce898a3ac87e",
"sha256": "efe9581a6cb2b1f9c2f610ac836a3ccb692cc98904ddbc898c6c03423a77bfef"
},
"downloads": -1,
"filename": "nanonets_extractor-0.1.4.tar.gz",
"has_sig": false,
"md5_digest": "01a3a8997c8b292f3e4cce898a3ac87e",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 31411,
"upload_time": "2025-07-23T11:17:54",
"upload_time_iso_8601": "2025-07-23T11:17:54.574131Z",
"url": "https://files.pythonhosted.org/packages/e8/ba/9abfed7374b856ae8a45aa7e7474299b5e239499698e7300818e8fa17454/nanonets_extractor-0.1.4.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-23 11:17:54",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "nanonets",
"github_project": "document-extractor",
"github_not_found": true,
"lcname": "nanonets-extractor"
}