
# DocStrange
[](https://badge.fury.io/py/docstrange)
[](https://pypi.org/project/docstrange/)
[](https://pepy.tech/projects/docstrange)
[](https://github.com/NanoNets/docstrange)
[](https://github.com/NanoNets/docstrange)
[](https://opensource.org/licenses/MIT)
[](https://pypi.org/project/docstrange/)
[](https://github.com/NanoNets/docstrange/graphs/commit-activity)
> **☁️ Free Cloud Processing!**
> Extract documents data instantly with the cloud processing - no setup or api key needed for getting started.
> **🔒 Local Processing Available!**
> Use `cpu` or `gpu` mode for 100% local processing - no data sent anywhere, everything stays on your machine.
Extract and convert data from any document, images, pdfs, word doc, ppt or URL into multiple formats (Markdown, JSON, CSV, HTML) with intelligent content extraction and advanced OCR.

## Key Features
- **☁️ Cloud Processing (Default)**: Instant free conversion with cloud API - no local setup needed
- **🔒 Local Processing**: CPU/GPU options for complete privacy - no data sent anywhere
- **Universal Input**: PDFs, Word docs, Excel, PowerPoint, images, URLs, and raw text
- **Smart Output**: Markdown, JSON, CSV, HTML, and plain text formats
- **LLM-Optimized**: Clean, structured output perfect for AI processing
- **Intelligent Extraction**: Extract specific fields or structured data using AI
- **Advanced OCR**: Multiple OCR engines with automatic fallback
- **Table Processing**: Accurate table extraction and formatting
- **Image Handling**: Extract text from images and visual content
- **URL Processing**: Direct conversion from web pages
## Installation
```bash
pip install docstrange
```
## Quick Start
### 1. Convert Document to Markdown
```python
from docstrange import DocumentExtractor
# Initialize extractor (cloud mode by default)
extractor = DocumentExtractor()
# Convert any document to clean markdown
result = extractor.extract("document.pdf")
markdown = result.extract_markdown()
print(markdown)
```
### 2. Extract All Important Information as JSON
```python
from docstrange import DocumentExtractor
# Extract document as structured JSON
extractor = DocumentExtractor()
result = extractor.extract("document.pdf")
# Get all important data as flat JSON
json_data = result.extract_data()
print(json_data)
```
### 3. Extract Specific Fields
```python
from docstrange import DocumentExtractor
# Extract only the fields you need
extractor = DocumentExtractor()
result = extractor.extract("invoice.pdf")
# Specify exactly which fields to extract
fields = result.extract_data(specified_fields=[
"invoice_number", "total_amount", "vendor_name", "due_date"
])
print(fields)
```
### 4. Extract with Custom JSON Schema
```python
from docstrange import DocumentExtractor
# Extract data conforming to your schema
extractor = DocumentExtractor()
result = extractor.extract("contract.pdf")
# Define your required structure
schema = {
"contract_number": "string",
"parties": ["string"],
"total_value": "number",
"start_date": "string",
"terms": ["string"]
}
structured_data = result.extract_data(json_schema=schema)
print(structured_data)
```
### Local Processing
```python
# Force local CPU processing
extractor = DocumentExtractor(cpu=True)
# Force local GPU processing (requires CUDA)
extractor = DocumentExtractor(gpu=True)
```
## Output Formats
- **Markdown**: Clean, LLM-friendly format with preserved structure
- **JSON**: Structured data with metadata and intelligent parsing
- **HTML**: Formatted output with styling and layout
- **CSV**: Extract tables and data in spreadsheet format
- **Text**: Plain text with smart formatting
## Examples
### Convert Multiple File Types
```python
from docstrange import DocumentExtractor
extractor = DocumentExtractor()
# PDF document
pdf_result = extractor.extract("report.pdf")
print(pdf_result.extract_markdown())
# Word document
docx_result = extractor.extract("document.docx")
print(docx_result.extract_data())
# Excel spreadsheet
excel_result = extractor.extract("data.xlsx")
print(excel_result.extract_csv())
# PowerPoint presentation
pptx_result = extractor.extract("slides.pptx")
print(pptx_result.extract_html())
# Image with text
image_result = extractor.extract("screenshot.png")
print(image_result.extract_text())
# Web page
url_result = extractor.extract("https://example.com")
print(url_result.extract_markdown())
```
### Extract Tables to CSV
```python
# Extract all tables from a document
result = extractor.extract("financial_report.pdf")
csv_data = result.extract_csv()
print(csv_data)
```
**Requirements for enhanced JSON (if using cpu=True):**
- Install: `pip install 'docstrange[local-llm]'`
- [Install Ollama](https://ollama.ai/) and run: `ollama serve`
- Pull a model: `ollama pull llama3.2`
*If Ollama is not available, the library automatically falls back to the standard JSON parser.*
### Extract Specific Fields & Structured Data
```python
# Extract specific fields from any document
result = extractor.extract("invoice.pdf")
# Method 1: Extract specific fields
extracted = result.extract_data(specified_fields=[
"invoice_number",
"total_amount",
"vendor_name",
"due_date"
])
# Method 2: Extract using JSON schema
schema = {
"invoice_number": "string",
"total_amount": "number",
"vendor_name": "string",
"line_items": [{
"description": "string",
"amount": "number"
}]
}
structured = result.extract_data(json_schema=schema)
```
**Cloud Mode Usage Examples:**
```python
from docstrange import DocumentExtractor
# Default cloud mode (rate-limited without API key)
extractor = DocumentExtractor()
# With API key for increased rate limit access
extractor = DocumentExtractor(api_key="your_api_key_here")
# Extract specific fields from invoice
result = extractor.extract("invoice.pdf")
# Extract key invoice information
invoice_fields = result.extract_data(specified_fields=[
"invoice_number",
"total_amount",
"vendor_name",
"due_date",
"items_count"
])
print("Extracted Invoice Fields:")
print(invoice_fields)
# Output: {"extracted_fields": {"invoice_number": "INV-001", ...}, "format": "specified_fields"}
# Extract structured data using schema
invoice_schema = {
"invoice_number": "string",
"total_amount": "number",
"vendor_name": "string",
"billing_address": {
"street": "string",
"city": "string",
"zip_code": "string"
},
"line_items": [{
"description": "string",
"quantity": "number",
"unit_price": "number",
"total": "number"
}],
"taxes": {
"tax_rate": "number",
"tax_amount": "number"
}
}
structured_invoice = result.extract_data(json_schema=invoice_schema)
print("Structured Invoice Data:")
print(structured_invoice)
# Output: {"structured_data": {...}, "schema": {...}, "format": "structured_json"}
# Extract from different document types
receipt = extractor.extract("receipt.jpg")
receipt_data = receipt.extract_data(specified_fields=[
"merchant_name", "total_amount", "date", "payment_method"
])
contract = extractor.extract("contract.pdf")
contract_schema = {
"parties": [{
"name": "string",
"role": "string"
}],
"contract_value": "number",
"start_date": "string",
"end_date": "string",
"key_terms": ["string"]
}
contract_data = contract.extract_data(json_schema=contract_schema)
```
**Local extraction requirements (if using cpu=True):**
- Install ollama package: `pip install 'docstrange[local-llm]'`
- [Install Ollama](https://ollama.ai/) and run: `ollama serve`
- Pull a model: `ollama pull llama3.2`
### Chain with LLM
```python
# Perfect for LLM workflows
document_text = extractor.extract("research_paper.pdf").extract_markdown()
# Use with any LLM
response = your_llm_client.chat(
messages=[{
"role": "user",
"content": f"Summarize this research paper:\n\n{document_text}"
}]
)
```
## Rate Limits
DocStrange offers **free cloud processing** with rate limits to ensure fair usage:
### Free Tier (No API Key)
- **Rate Limit**: Moderate usage restrictions apply
- **Access**: All output formats (Markdown, JSON, CSV, HTML)
- **Setup**: Zero configuration - works immediately
### Increased Rate Limits (With API Key)
- **Rate Limit**: Higher limits for production use
- **Setup**: Get your free API key from [app.nanonets.com](https://app.nanonets.com/#/keys)
- **Usage**: Pass API key during initialization
```python
# Free tier usage
extractor = DocumentExtractor()
# Increased rate limits with API key
extractor = DocumentExtractor(api_key="your_api_key_here")
```
> **💡 Tip**: Start with the free tier to test functionality, then get an API key for production workloads or higher volume processing.
## Command Line Interface
```bash
# Basic conversion (cloud mode default)
docstrange document.pdf
# With API key for increased rate limit access
docstrange document.pdf --api-key YOUR_API_KEY
# Local processing modes
docstrange document.pdf --cpu-mode
docstrange document.pdf --gpu-mode
# Different output formats
docstrange document.pdf --output json
docstrange document.pdf --output html
docstrange document.pdf --output csv
# Extract specific fields
docstrange invoice.pdf --output json --extract-fields invoice_number total_amount
# Extract with JSON schema
docstrange document.pdf --output json --json-schema schema.json
# Multiple files
docstrange *.pdf --output markdown
# Save to file
docstrange document.pdf --output-file result.md
# Comprehensive field extraction examples
docstrange invoice.pdf --output json --extract-fields invoice_number vendor_name total_amount due_date line_items
# Extract from different document types with specific fields
docstrange receipt.jpg --output json --extract-fields merchant_name total_amount date payment_method
docstrange contract.pdf --output json --extract-fields parties contract_value start_date end_date
# Using JSON schema files for structured extraction
docstrange invoice.pdf --output json --json-schema invoice_schema.json
docstrange contract.pdf --output json --json-schema contract_schema.json
# Combine with API key for increased rate limit access
docstrange document.pdf --api-key YOUR_API_KEY --output json --extract-fields title author date summary
# Force local processing with field extraction (requires Ollama)
docstrange document.pdf --cpu-mode --output json --extract-fields key_points conclusions recommendations
```
**Example schema.json file:**
```json
{
"invoice_number": "string",
"total_amount": "number",
"vendor_name": "string",
"billing_address": {
"street": "string",
"city": "string",
"zip_code": "string"
},
"line_items": [{
"description": "string",
"quantity": "number",
"unit_price": "number"
}]
}
```
## API Reference for library
### DocumentExtractor
```python
DocumentExtractor(
api_key: str = None, # free API key for increased rate limit on cloud access
model: str = None, # Model for cloud processing ("gemini", "openapi", "nanonets")
cpu: bool = False, # Force local CPU processing
gpu: bool = False # Force local GPU processing
)
```
### ConversionResult Methods
```python
result.extract_markdown() -> str # Clean markdown output
result.extract_data( # Structured JSON
specified_fields: List[str] = None, # Extract specific fields
json_schema: Dict = None # Extract with schema
) -> Dict
result.extract_html() -> str # Formatted HTML
result.extract_csv() -> str # CSV format for tables
result.extract_text() -> str # Plain text
```
## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## Support
- **Email**: support@nanonets.com
- **Issues**: [GitHub Issues](https://github.com/NanoNets/docstrange/issues)
- **Discussions**: [GitHub Discussions](https://github.com/NanoNets/docstrange/discussions)
---
**Star this repo** if you find it helpful! Your support helps us improve the library.
Raw data
{
"_id": null,
"home_page": null,
"name": "docstrange",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "llm, document-processing, document-conversion, markdown, pdf, image-processing, intelligent-document-processing, document-understanding, ocr, rag, ai-training-data, unstructured-alternative, docling-alternative, marker-alternative, markitdown-alternative, mineru-alternative, paddleocr-alternative, tesseract-alternative, document-to-markdown, pdf-to-markdown, local-document-processing, offline-document-extractor, structured-data-extraction, table-extraction, layout-detection, llm-ready-data, document-ai, text-extraction, html-to-markdown, excel-to-markdown, powerpoint-to-markdown, word-to-markdown, batch-document-processing, docstrange",
"author": null,
"author_email": "Nanonets <team@nanonets.com>",
"download_url": "https://files.pythonhosted.org/packages/22/fb/93f15ffc8a624266c5a6531016d01fceefeeea7bbf968da1b901c618c705/docstrange-1.0.9.tar.gz",
"platform": null,
"description": "\n\n# DocStrange\n\n\n[](https://badge.fury.io/py/docstrange)\n[](https://pypi.org/project/docstrange/)\n[](https://pepy.tech/projects/docstrange)\n[](https://github.com/NanoNets/docstrange)\n[](https://github.com/NanoNets/docstrange)\n[](https://opensource.org/licenses/MIT)\n[](https://pypi.org/project/docstrange/)\n[](https://github.com/NanoNets/docstrange/graphs/commit-activity)\n\n> **\u2601\ufe0f Free Cloud Processing!** \n> Extract documents data instantly with the cloud processing - no setup or api key needed for getting started. \n\n> **\ud83d\udd12 Local Processing Available!** \n> Use `cpu` or `gpu` mode for 100% local processing - no data sent anywhere, everything stays on your machine.\n\nExtract and convert data from any document, images, pdfs, word doc, ppt or URL into multiple formats (Markdown, JSON, CSV, HTML) with intelligent content extraction and advanced OCR.\n\n\n\n## Key Features\n\n- **\u2601\ufe0f Cloud Processing (Default)**: Instant free conversion with cloud API - no local setup needed\n- **\ud83d\udd12 Local Processing**: CPU/GPU options for complete privacy - no data sent anywhere\n- **Universal Input**: PDFs, Word docs, Excel, PowerPoint, images, URLs, and raw text\n- **Smart Output**: Markdown, JSON, CSV, HTML, and plain text formats\n- **LLM-Optimized**: Clean, structured output perfect for AI processing\n- **Intelligent Extraction**: Extract specific fields or structured data using AI\n- **Advanced OCR**: Multiple OCR engines with automatic fallback\n- **Table Processing**: Accurate table extraction and formatting\n- **Image Handling**: Extract text from images and visual content\n- **URL Processing**: Direct conversion from web pages\n\n## Installation\n\n```bash\npip install docstrange\n```\n\n## Quick Start\n\n### 1. Convert Document to Markdown\n\n```python\nfrom docstrange import DocumentExtractor\n\n# Initialize extractor (cloud mode by default)\nextractor = DocumentExtractor()\n\n# Convert any document to clean markdown\nresult = extractor.extract(\"document.pdf\")\nmarkdown = result.extract_markdown()\nprint(markdown)\n```\n\n### 2. Extract All Important Information as JSON\n\n```python\nfrom docstrange import DocumentExtractor\n\n# Extract document as structured JSON\nextractor = DocumentExtractor()\nresult = extractor.extract(\"document.pdf\")\n\n# Get all important data as flat JSON\njson_data = result.extract_data()\nprint(json_data)\n```\n\n### 3. Extract Specific Fields\n\n```python\nfrom docstrange import DocumentExtractor\n\n# Extract only the fields you need\nextractor = DocumentExtractor()\nresult = extractor.extract(\"invoice.pdf\")\n\n# Specify exactly which fields to extract\nfields = result.extract_data(specified_fields=[\n \"invoice_number\", \"total_amount\", \"vendor_name\", \"due_date\"\n])\nprint(fields)\n```\n\n### 4. Extract with Custom JSON Schema\n\n```python\nfrom docstrange import DocumentExtractor\n\n# Extract data conforming to your schema\nextractor = DocumentExtractor()\nresult = extractor.extract(\"contract.pdf\")\n\n# Define your required structure\nschema = {\n \"contract_number\": \"string\",\n \"parties\": [\"string\"],\n \"total_value\": \"number\",\n \"start_date\": \"string\",\n \"terms\": [\"string\"]\n}\n\nstructured_data = result.extract_data(json_schema=schema)\nprint(structured_data)\n```\n\n\n### Local Processing\n\n```python\n# Force local CPU processing\nextractor = DocumentExtractor(cpu=True)\n\n# Force local GPU processing (requires CUDA)\nextractor = DocumentExtractor(gpu=True)\n```\n\n## Output Formats\n\n- **Markdown**: Clean, LLM-friendly format with preserved structure\n- **JSON**: Structured data with metadata and intelligent parsing\n- **HTML**: Formatted output with styling and layout\n- **CSV**: Extract tables and data in spreadsheet format\n- **Text**: Plain text with smart formatting\n\n## Examples\n\n### Convert Multiple File Types\n\n```python\nfrom docstrange import DocumentExtractor\n\nextractor = DocumentExtractor()\n\n# PDF document\npdf_result = extractor.extract(\"report.pdf\")\nprint(pdf_result.extract_markdown())\n\n# Word document \ndocx_result = extractor.extract(\"document.docx\")\nprint(docx_result.extract_data())\n\n# Excel spreadsheet\nexcel_result = extractor.extract(\"data.xlsx\")\nprint(excel_result.extract_csv())\n\n# PowerPoint presentation\npptx_result = extractor.extract(\"slides.pptx\")\nprint(pptx_result.extract_html())\n\n# Image with text\nimage_result = extractor.extract(\"screenshot.png\")\nprint(image_result.extract_text())\n\n# Web page\nurl_result = extractor.extract(\"https://example.com\")\nprint(url_result.extract_markdown())\n```\n\n### Extract Tables to CSV\n\n```python\n# Extract all tables from a document\nresult = extractor.extract(\"financial_report.pdf\")\ncsv_data = result.extract_csv()\nprint(csv_data)\n```\n\n\n**Requirements for enhanced JSON (if using cpu=True):**\n- Install: `pip install 'docstrange[local-llm]'`\n- [Install Ollama](https://ollama.ai/) and run: `ollama serve`\n- Pull a model: `ollama pull llama3.2`\n\n*If Ollama is not available, the library automatically falls back to the standard JSON parser.*\n\n### Extract Specific Fields & Structured Data\n\n```python\n# Extract specific fields from any document\nresult = extractor.extract(\"invoice.pdf\")\n\n# Method 1: Extract specific fields\nextracted = result.extract_data(specified_fields=[\n \"invoice_number\", \n \"total_amount\", \n \"vendor_name\",\n \"due_date\"\n])\n\n# Method 2: Extract using JSON schema\nschema = {\n \"invoice_number\": \"string\",\n \"total_amount\": \"number\", \n \"vendor_name\": \"string\",\n \"line_items\": [{\n \"description\": \"string\",\n \"amount\": \"number\"\n }]\n}\n\nstructured = result.extract_data(json_schema=schema)\n```\n\n\n**Cloud Mode Usage Examples:**\n\n```python\nfrom docstrange import DocumentExtractor\n\n# Default cloud mode (rate-limited without API key)\nextractor = DocumentExtractor()\n\n# With API key for increased rate limit access\nextractor = DocumentExtractor(api_key=\"your_api_key_here\")\n\n# Extract specific fields from invoice\nresult = extractor.extract(\"invoice.pdf\")\n\n# Extract key invoice information\ninvoice_fields = result.extract_data(specified_fields=[\n \"invoice_number\",\n \"total_amount\", \n \"vendor_name\",\n \"due_date\",\n \"items_count\"\n])\n\nprint(\"Extracted Invoice Fields:\")\nprint(invoice_fields)\n# Output: {\"extracted_fields\": {\"invoice_number\": \"INV-001\", ...}, \"format\": \"specified_fields\"}\n\n# Extract structured data using schema\ninvoice_schema = {\n \"invoice_number\": \"string\",\n \"total_amount\": \"number\",\n \"vendor_name\": \"string\",\n \"billing_address\": {\n \"street\": \"string\",\n \"city\": \"string\", \n \"zip_code\": \"string\"\n },\n \"line_items\": [{\n \"description\": \"string\",\n \"quantity\": \"number\",\n \"unit_price\": \"number\",\n \"total\": \"number\"\n }],\n \"taxes\": {\n \"tax_rate\": \"number\",\n \"tax_amount\": \"number\"\n }\n}\n\nstructured_invoice = result.extract_data(json_schema=invoice_schema)\nprint(\"Structured Invoice Data:\")\nprint(structured_invoice)\n# Output: {\"structured_data\": {...}, \"schema\": {...}, \"format\": \"structured_json\"}\n\n# Extract from different document types\nreceipt = extractor.extract(\"receipt.jpg\")\nreceipt_data = receipt.extract_data(specified_fields=[\n \"merchant_name\", \"total_amount\", \"date\", \"payment_method\"\n])\n\ncontract = extractor.extract(\"contract.pdf\") \ncontract_schema = {\n \"parties\": [{\n \"name\": \"string\",\n \"role\": \"string\"\n }],\n \"contract_value\": \"number\",\n \"start_date\": \"string\",\n \"end_date\": \"string\",\n \"key_terms\": [\"string\"]\n}\ncontract_data = contract.extract_data(json_schema=contract_schema)\n```\n\n**Local extraction requirements (if using cpu=True):**\n- Install ollama package: `pip install 'docstrange[local-llm]'`\n- [Install Ollama](https://ollama.ai/) and run: `ollama serve`\n- Pull a model: `ollama pull llama3.2`\n\n### Chain with LLM\n\n```python\n# Perfect for LLM workflows\ndocument_text = extractor.extract(\"research_paper.pdf\").extract_markdown()\n\n# Use with any LLM\nresponse = your_llm_client.chat(\n messages=[{\n \"role\": \"user\", \n \"content\": f\"Summarize this research paper:\\n\\n{document_text}\"\n }]\n)\n```\n\n## Rate Limits\n\nDocStrange offers **free cloud processing** with rate limits to ensure fair usage:\n\n### Free Tier (No API Key)\n- **Rate Limit**: Moderate usage restrictions apply\n- **Access**: All output formats (Markdown, JSON, CSV, HTML)\n- **Setup**: Zero configuration - works immediately\n\n### Increased Rate Limits (With API Key)\n- **Rate Limit**: Higher limits for production use\n- **Setup**: Get your free API key from [app.nanonets.com](https://app.nanonets.com/#/keys)\n- **Usage**: Pass API key during initialization \n\n```python\n# Free tier usage\nextractor = DocumentExtractor()\n\n# Increased rate limits with API key\nextractor = DocumentExtractor(api_key=\"your_api_key_here\")\n\n```\n\n> **\ud83d\udca1 Tip**: Start with the free tier to test functionality, then get an API key for production workloads or higher volume processing.\n\n## Command Line Interface\n\n```bash\n# Basic conversion (cloud mode default)\ndocstrange document.pdf\n\n# With API key for increased rate limit access\ndocstrange document.pdf --api-key YOUR_API_KEY\n\n# Local processing modes\ndocstrange document.pdf --cpu-mode\ndocstrange document.pdf --gpu-mode\n\n# Different output formats\ndocstrange document.pdf --output json\ndocstrange document.pdf --output html\ndocstrange document.pdf --output csv\n\n# Extract specific fields\ndocstrange invoice.pdf --output json --extract-fields invoice_number total_amount\n\n# Extract with JSON schema\ndocstrange document.pdf --output json --json-schema schema.json\n\n# Multiple files\ndocstrange *.pdf --output markdown\n\n# Save to file\ndocstrange document.pdf --output-file result.md\n\n# Comprehensive field extraction examples\ndocstrange invoice.pdf --output json --extract-fields invoice_number vendor_name total_amount due_date line_items\n\n# Extract from different document types with specific fields\ndocstrange receipt.jpg --output json --extract-fields merchant_name total_amount date payment_method\n\ndocstrange contract.pdf --output json --extract-fields parties contract_value start_date end_date\n\n# Using JSON schema files for structured extraction\ndocstrange invoice.pdf --output json --json-schema invoice_schema.json\ndocstrange contract.pdf --output json --json-schema contract_schema.json\n\n# Combine with API key for increased rate limit access\ndocstrange document.pdf --api-key YOUR_API_KEY --output json --extract-fields title author date summary\n\n# Force local processing with field extraction (requires Ollama)\ndocstrange document.pdf --cpu-mode --output json --extract-fields key_points conclusions recommendations\n```\n\n**Example schema.json file:**\n```json\n{\n \"invoice_number\": \"string\",\n \"total_amount\": \"number\",\n \"vendor_name\": \"string\",\n \"billing_address\": {\n \"street\": \"string\",\n \"city\": \"string\",\n \"zip_code\": \"string\"\n },\n \"line_items\": [{\n \"description\": \"string\",\n \"quantity\": \"number\",\n \"unit_price\": \"number\"\n }]\n}\n```\n\n## API Reference for library\n\n### DocumentExtractor\n\n```python\nDocumentExtractor(\n api_key: str = None, # free API key for increased rate limit on cloud access\n model: str = None, # Model for cloud processing (\"gemini\", \"openapi\", \"nanonets\")\n cpu: bool = False, # Force local CPU processing\n gpu: bool = False # Force local GPU processing\n)\n```\n\n### ConversionResult Methods\n\n```python\nresult.extract_markdown() -> str # Clean markdown output\nresult.extract_data( # Structured JSON\n specified_fields: List[str] = None, # Extract specific fields\n json_schema: Dict = None # Extract with schema\n) -> Dict\nresult.extract_html() -> str # Formatted HTML\nresult.extract_csv() -> str # CSV format for tables\nresult.extract_text() -> str # Plain text\n```\n\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## Support\n\n- **Email**: support@nanonets.com \n- **Issues**: [GitHub Issues](https://github.com/NanoNets/docstrange/issues)\n- **Discussions**: [GitHub Discussions](https://github.com/NanoNets/docstrange/discussions)\n\n---\n\n**Star this repo** if you find it helpful! Your support helps us improve the library. \n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Extract and Convert PDF, Word, PowerPoint, Excel, images, URLs into multiple formats (Markdown, JSON, CSV, HTML) with intelligent content extraction and advanced OCR.",
"version": "1.0.9",
"project_urls": {
"Documentation": "https://github.com/nanonets/docstrange#readme",
"Homepage": "https://github.com/nanonets/docstrange",
"Issues": "https://github.com/nanonets/docstrange/issues",
"Repository": "https://github.com/nanonets/docstrange"
},
"split_keywords": [
"llm",
" document-processing",
" document-conversion",
" markdown",
" pdf",
" image-processing",
" intelligent-document-processing",
" document-understanding",
" ocr",
" rag",
" ai-training-data",
" unstructured-alternative",
" docling-alternative",
" marker-alternative",
" markitdown-alternative",
" mineru-alternative",
" paddleocr-alternative",
" tesseract-alternative",
" document-to-markdown",
" pdf-to-markdown",
" local-document-processing",
" offline-document-extractor",
" structured-data-extraction",
" table-extraction",
" layout-detection",
" llm-ready-data",
" document-ai",
" text-extraction",
" html-to-markdown",
" excel-to-markdown",
" powerpoint-to-markdown",
" word-to-markdown",
" batch-document-processing",
" docstrange"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "1d28fb4b7e96cb0576974fb4682f23e8ca6a4534a42ae9108e55368c3b858973",
"md5": "063e8046580cf94902f3ebed682f110d",
"sha256": "cd48e04f3fec8199805d42821bd578b574efeac7ed21349690f95396f21b3192"
},
"downloads": -1,
"filename": "docstrange-1.0.9-py3-none-any.whl",
"has_sig": false,
"md5_digest": "063e8046580cf94902f3ebed682f110d",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 71973,
"upload_time": "2025-08-01T15:43:49",
"upload_time_iso_8601": "2025-08-01T15:43:49.022985Z",
"url": "https://files.pythonhosted.org/packages/1d/28/fb4b7e96cb0576974fb4682f23e8ca6a4534a42ae9108e55368c3b858973/docstrange-1.0.9-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "22fb93f15ffc8a624266c5a6531016d01fceefeeea7bbf968da1b901c618c705",
"md5": "18902fb95a26ad5347c546344a655792",
"sha256": "cfdb08edbcf5f9763f23345cf8788d906768cda7bf24ea72256cec4f39040a22"
},
"downloads": -1,
"filename": "docstrange-1.0.9.tar.gz",
"has_sig": false,
"md5_digest": "18902fb95a26ad5347c546344a655792",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 78189,
"upload_time": "2025-08-01T15:43:50",
"upload_time_iso_8601": "2025-08-01T15:43:50.647448Z",
"url": "https://files.pythonhosted.org/packages/22/fb/93f15ffc8a624266c5a6531016d01fceefeeea7bbf968da1b901c618c705/docstrange-1.0.9.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-01 15:43:50",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "nanonets",
"github_project": "docstrange#readme",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "docstrange"
}