Name | llm-data-converter JSON |
Version |
2.2.0
JSON |
| download |
home_page | None |
Summary | Best open-source document to markdown converter for LLM training data. Convert PDF, Word, PowerPoint, Excel, images, URLs to clean markdown, JSON, HTML locally. Alternative to Unstructured, Docling, Marker, MarkItDown, MinerU, PaddleOCR, Tesseract |
upload_time | 2025-07-25 13:32:07 |
maintainer | None |
docs_url | None |
author | None |
requires_python | >=3.8 |
license | MIT |
keywords |
ai-training-data
batch-document-processing
docling-alternative
document-ai
document-conversion
document-processing
document-to-markdown
document-understanding
excel-to-markdown
html-to-markdown
image-processing
intelligent-document-processing
layout-detection
llm
llm-ready-data
local-document-processing
markdown
marker-alternative
markitdown-alternative
mineru-alternative
ocr
offline-document-converter
paddleocr-alternative
pdf
pdf-to-markdown
powerpoint-to-markdown
rag
structured-data-extraction
table-extraction
tesseract-alternative
text-extraction
unstructured-alternative
word-to-markdown
|
VCS |
 |
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# LLM Data Converter
[](https://badge.fury.io/py/llm-data-converter)
[](https://pepy.tech/project/llm-data-converter)
[](https://pypi.org/project/llm-data-converter/)
[](https://github.com/NanoNets/llm-data-converter)
[](https://opensource.org/licenses/MIT)
> **Try Cloud Mode for Free!**
> Convert documents instantly with our cloud API - no setup required.
> For unlimited processing, [get your free API key](https://app.nanonets.com/#/keys).
Transform any document, image, or URL into LLM-ready formats (Markdown, JSON, CSV, HTML) with intelligent content extraction and advanced OCR.
## Key Features
- **Cloud Processing (Default)**: Instant conversion with Nanonets API - no local setup needed
- **Local Processing**: CPU/GPU options for complete privacy and control
- **Universal Input**: PDFs, Word docs, Excel, PowerPoint, images, URLs, and raw text
- **Smart Output**: Markdown, JSON, CSV, HTML, and plain text formats
- **LLM-Optimized**: Clean, structured output perfect for AI processing
- **Intelligent Extraction**: Extract specific fields or structured data using AI
- **Advanced OCR**: Multiple OCR engines with automatic fallback
- **Table Processing**: Accurate table extraction and formatting
- **Image Handling**: Extract text from images and visual content
- **URL Processing**: Direct conversion from web pages
## Installation
```bash
pip install llm-data-converter
```
## Quick Start
### Basic Usage (Cloud Mode - Default)
```python
from llm_converter import FileConverter
# Default cloud mode - no setup required
converter = FileConverter()
# Convert any document
result = converter.convert("document.pdf")
# Get different output formats
markdown = result.to_markdown()
json_data = result.to_json()
html = result.to_html()
csv_tables = result.to_csv()
# Extract specific fields
extracted_fields = result.to_json(specified_fields=[
"title", "author", "date", "summary", "key_points"
])
# Extract using JSON schema
schema = {
"title": "string",
"author": "string",
"date": "string",
"summary": "string",
"key_points": ["string"],
"metadata": {
"page_count": "number",
"language": "string"
}
}
structured_data = result.to_json(json_schema=schema)
```
### With API Key (Unlimited Access)
```python
# Get your free API key from https://app.nanonets.com/#/keys
converter = FileConverter(api_key="your_api_key_here")
result = converter.convert("document.pdf")
```
### Local Processing
```python
# Force local CPU processing
converter = FileConverter(cpu_preference=True)
# Force local GPU processing (requires CUDA)
converter = FileConverter(gpu_preference=True)
```
## Output Formats
- **Markdown**: Clean, LLM-friendly format with preserved structure
- **JSON**: Structured data with metadata and intelligent parsing
- **HTML**: Formatted output with styling and layout
- **CSV**: Extract tables and data in spreadsheet format
- **Text**: Plain text with smart formatting
## Examples
### Convert Multiple File Types
```python
from llm_converter import FileConverter
converter = FileConverter()
# PDF document
pdf_result = converter.convert("report.pdf")
print(pdf_result.to_markdown())
# Word document
docx_result = converter.convert("document.docx")
print(docx_result.to_json())
# Excel spreadsheet
excel_result = converter.convert("data.xlsx")
print(excel_result.to_csv())
# PowerPoint presentation
pptx_result = converter.convert("slides.pptx")
print(pptx_result.to_html())
# Image with text
image_result = converter.convert("screenshot.png")
print(image_result.to_text())
# Web page
url_result = converter.convert("https://example.com")
print(url_result.to_markdown())
```
### Extract Tables to CSV
```python
# Extract all tables from a document
result = converter.convert("financial_report.pdf")
csv_data = result.to_csv(include_all_tables=True)
print(csv_data)
```
### Enhanced JSON Conversion
The library now uses intelligent document understanding for JSON conversion:
```python
from llm_converter import FileConverter
converter = FileConverter()
result = converter.convert("document.pdf")
# Enhanced JSON with Ollama (when available)
json_data = result.to_json()
print(json_data["format"]) # "ollama_structured_json" or "structured_json"
# The enhanced conversion provides:
# - Better document structure understanding
# - Intelligent table parsing
# - Automatic metadata extraction
# - Key information identification
# - Proper data type handling
```
**Requirements for enhanced JSON (if using cpu_preference=True):**
- Install: `pip install 'llm-data-converter[local-llm]'`
- [Install Ollama](https://ollama.ai/) and run: `ollama serve`
- Pull a model: `ollama pull llama3.2`
*If Ollama is not available, the library automatically falls back to the standard JSON parser.*
### Extract Specific Fields & Structured Data
```python
# Extract specific fields from any document
result = converter.convert("invoice.pdf")
# Method 1: Extract specific fields
extracted = result.to_json(specified_fields=[
"invoice_number",
"total_amount",
"vendor_name",
"due_date"
])
# Method 2: Extract using JSON schema
schema = {
"invoice_number": "string",
"total_amount": "number",
"vendor_name": "string",
"line_items": [{
"description": "string",
"amount": "number"
}]
}
structured = result.to_json(json_schema=schema)
```
**How it works:**
- Automatically uses cloud API when available
- Falls back to local Ollama for privacy-focused processing
- Same interface works for both cloud and local modes
**Cloud Mode Usage Examples:**
```python
from llm_converter import FileConverter
# Default cloud mode (rate-limited without API key)
converter = FileConverter()
# With API key for unlimited access
converter = FileConverter(api_key="your_api_key_here")
# Extract specific fields from invoice
result = converter.convert("invoice.pdf")
# Extract key invoice information
invoice_fields = result.to_json(specified_fields=[
"invoice_number",
"total_amount",
"vendor_name",
"due_date",
"items_count"
])
print("Extracted Invoice Fields:")
print(invoice_fields)
# Output: {"extracted_fields": {"invoice_number": "INV-001", ...}, "format": "specified_fields"}
# Extract structured data using schema
invoice_schema = {
"invoice_number": "string",
"total_amount": "number",
"vendor_name": "string",
"billing_address": {
"street": "string",
"city": "string",
"zip_code": "string"
},
"line_items": [{
"description": "string",
"quantity": "number",
"unit_price": "number",
"total": "number"
}],
"taxes": {
"tax_rate": "number",
"tax_amount": "number"
}
}
structured_invoice = result.to_json(json_schema=invoice_schema)
print("Structured Invoice Data:")
print(structured_invoice)
# Output: {"structured_data": {...}, "schema": {...}, "format": "structured_json"}
# Extract from different document types
receipt = converter.convert("receipt.jpg")
receipt_data = receipt.to_json(specified_fields=[
"merchant_name", "total_amount", "date", "payment_method"
])
contract = converter.convert("contract.pdf")
contract_schema = {
"parties": [{
"name": "string",
"role": "string"
}],
"contract_value": "number",
"start_date": "string",
"end_date": "string",
"key_terms": ["string"]
}
contract_data = contract.to_json(json_schema=contract_schema)
```
**Local extraction requirements (if using cpu_preference=True):**
- Install ollama package: `pip install 'llm-data-converter[local-llm]'`
- [Install Ollama](https://ollama.ai/) and run: `ollama serve`
- Pull a model: `ollama pull llama3.2`
### Chain with LLM
```python
# Perfect for LLM workflows
document_text = converter.convert("research_paper.pdf").to_markdown()
# Use with any LLM
response = your_llm_client.chat(
messages=[{
"role": "user",
"content": f"Summarize this research paper:\n\n{document_text}"
}]
)
```
## Command Line Interface
```bash
# Basic conversion (cloud mode default)
llm-converter document.pdf
# With API key for unlimited access
llm-converter document.pdf --api-key YOUR_API_KEY
# Local processing modes
llm-converter document.pdf --cpu-mode
llm-converter document.pdf --gpu-mode
# Different output formats
llm-converter document.pdf --output json
llm-converter document.pdf --output html
llm-converter document.pdf --output csv
# Extract specific fields
llm-converter invoice.pdf --output json --extract-fields invoice_number total_amount
# Extract with JSON schema
llm-converter document.pdf --output json --json-schema schema.json
# Multiple files
llm-converter *.pdf --output markdown
# Save to file
llm-converter document.pdf --output-file result.md
# Comprehensive field extraction examples
llm-converter invoice.pdf --output json --extract-fields invoice_number vendor_name total_amount due_date line_items
# Extract from different document types with specific fields
llm-converter receipt.jpg --output json --extract-fields merchant_name total_amount date payment_method
llm-converter contract.pdf --output json --extract-fields parties contract_value start_date end_date
# Using JSON schema files for structured extraction
llm-converter invoice.pdf --output json --json-schema invoice_schema.json
llm-converter contract.pdf --output json --json-schema contract_schema.json
# Combine with API key for unlimited access
llm-converter document.pdf --api-key YOUR_API_KEY --output json --extract-fields title author date summary
# Force local processing with field extraction (requires Ollama)
llm-converter document.pdf --cpu-mode --output json --extract-fields key_points conclusions recommendations
```
**Example schema.json file:**
```json
{
"invoice_number": "string",
"total_amount": "number",
"vendor_name": "string",
"billing_address": {
"street": "string",
"city": "string",
"zip_code": "string"
},
"line_items": [{
"description": "string",
"quantity": "number",
"unit_price": "number"
}]
}
```
## API Reference for library
### FileConverter
```python
FileConverter(
preserve_layout: bool = True, # Preserve document structure
include_images: bool = True, # Include image content
ocr_enabled: bool = True, # Enable OCR processing
api_key: str = None, # API key for unlimited cloud access
model: str = None, # Model for cloud processing ("gemini", "openapi")
cpu_preference: bool = False, # Force local CPU processing
gpu_preference: bool = False # Force local GPU processing
)
```
### ConversionResult Methods
```python
result.to_markdown() -> str # Clean markdown output
result.to_json( # Structured JSON
specified_fields: List[str] = None, # Extract specific fields
json_schema: Dict = None # Extract with schema
) -> Dict
result.to_html() -> str # Formatted HTML
result.to_csv() -> str # CSV format for tables
result.to_text() -> str # Plain text
```
## Advanced Configuration
### Custom OCR Settings
```python
converter = FileConverter(
cpu_preference=True, # Use local processing
ocr_enabled=True, # Enable OCR
preserve_layout=True, # Maintain structure
include_images=True # Process images
)
```
### Environment Variables
```bash
export NANONETS_API_KEY="your_api_key"
# Now all conversions use your API key automatically
```
## Contributing
We welcome contributions! Please see our [Contributing Guidelines](CONTRIBUTING.md) for details.
## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## Support
- **Email**: support@nanonets.com
- **Issues**: [GitHub Issues](https://github.com/NanoNets/llm-data-converter/issues)
- **Discussions**: [GitHub Discussions](https://github.com/NanoNets/llm-data-converter/discussions)
---
**Star this repo** if you find it helpful! Your support helps us improve the library.
Raw data
{
"_id": null,
"home_page": null,
"name": "llm-data-converter",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "ai-training-data, batch-document-processing, docling-alternative, document-ai, document-conversion, document-processing, document-to-markdown, document-understanding, excel-to-markdown, html-to-markdown, image-processing, intelligent-document-processing, layout-detection, llm, llm-ready-data, local-document-processing, markdown, marker-alternative, markitdown-alternative, mineru-alternative, ocr, offline-document-converter, paddleocr-alternative, pdf, pdf-to-markdown, powerpoint-to-markdown, rag, structured-data-extraction, table-extraction, tesseract-alternative, text-extraction, unstructured-alternative, word-to-markdown",
"author": null,
"author_email": "Nanonets <team@nanonets.com>",
"download_url": "https://files.pythonhosted.org/packages/72/b2/acdd48fd94704ce97490d8718de5fa862be400ea6bed760bdac8507662ca/llm_data_converter-2.2.0.tar.gz",
"platform": null,
"description": "# LLM Data Converter\n\n[](https://badge.fury.io/py/llm-data-converter)\n[](https://pepy.tech/project/llm-data-converter)\n[](https://pypi.org/project/llm-data-converter/)\n[](https://github.com/NanoNets/llm-data-converter)\n[](https://opensource.org/licenses/MIT)\n\n> **Try Cloud Mode for Free!** \n> Convert documents instantly with our cloud API - no setup required. \n> For unlimited processing, [get your free API key](https://app.nanonets.com/#/keys).\n\nTransform any document, image, or URL into LLM-ready formats (Markdown, JSON, CSV, HTML) with intelligent content extraction and advanced OCR.\n\n## Key Features\n\n- **Cloud Processing (Default)**: Instant conversion with Nanonets API - no local setup needed\n- **Local Processing**: CPU/GPU options for complete privacy and control\n- **Universal Input**: PDFs, Word docs, Excel, PowerPoint, images, URLs, and raw text\n- **Smart Output**: Markdown, JSON, CSV, HTML, and plain text formats\n- **LLM-Optimized**: Clean, structured output perfect for AI processing\n- **Intelligent Extraction**: Extract specific fields or structured data using AI\n- **Advanced OCR**: Multiple OCR engines with automatic fallback\n- **Table Processing**: Accurate table extraction and formatting\n- **Image Handling**: Extract text from images and visual content\n- **URL Processing**: Direct conversion from web pages\n\n## Installation\n\n```bash\npip install llm-data-converter\n```\n\n## Quick Start\n\n### Basic Usage (Cloud Mode - Default)\n\n```python\nfrom llm_converter import FileConverter\n\n# Default cloud mode - no setup required\nconverter = FileConverter()\n\n# Convert any document\nresult = converter.convert(\"document.pdf\")\n\n# Get different output formats\nmarkdown = result.to_markdown()\njson_data = result.to_json()\nhtml = result.to_html()\ncsv_tables = result.to_csv()\n\n# Extract specific fields\nextracted_fields = result.to_json(specified_fields=[\n \"title\", \"author\", \"date\", \"summary\", \"key_points\"\n])\n\n# Extract using JSON schema\nschema = {\n \"title\": \"string\",\n \"author\": \"string\", \n \"date\": \"string\",\n \"summary\": \"string\",\n \"key_points\": [\"string\"],\n \"metadata\": {\n \"page_count\": \"number\",\n \"language\": \"string\"\n }\n}\nstructured_data = result.to_json(json_schema=schema)\n```\n\n### With API Key (Unlimited Access)\n\n```python\n# Get your free API key from https://app.nanonets.com/#/keys\nconverter = FileConverter(api_key=\"your_api_key_here\")\nresult = converter.convert(\"document.pdf\")\n```\n\n### Local Processing\n\n```python\n# Force local CPU processing\nconverter = FileConverter(cpu_preference=True)\n\n# Force local GPU processing (requires CUDA)\nconverter = FileConverter(gpu_preference=True)\n```\n\n## Output Formats\n\n- **Markdown**: Clean, LLM-friendly format with preserved structure\n- **JSON**: Structured data with metadata and intelligent parsing\n- **HTML**: Formatted output with styling and layout\n- **CSV**: Extract tables and data in spreadsheet format\n- **Text**: Plain text with smart formatting\n\n## Examples\n\n### Convert Multiple File Types\n\n```python\nfrom llm_converter import FileConverter\n\nconverter = FileConverter()\n\n# PDF document\npdf_result = converter.convert(\"report.pdf\")\nprint(pdf_result.to_markdown())\n\n# Word document \ndocx_result = converter.convert(\"document.docx\")\nprint(docx_result.to_json())\n\n# Excel spreadsheet\nexcel_result = converter.convert(\"data.xlsx\")\nprint(excel_result.to_csv())\n\n# PowerPoint presentation\npptx_result = converter.convert(\"slides.pptx\")\nprint(pptx_result.to_html())\n\n# Image with text\nimage_result = converter.convert(\"screenshot.png\")\nprint(image_result.to_text())\n\n# Web page\nurl_result = converter.convert(\"https://example.com\")\nprint(url_result.to_markdown())\n```\n\n### Extract Tables to CSV\n\n```python\n# Extract all tables from a document\nresult = converter.convert(\"financial_report.pdf\")\ncsv_data = result.to_csv(include_all_tables=True)\nprint(csv_data)\n```\n\n### Enhanced JSON Conversion\n\nThe library now uses intelligent document understanding for JSON conversion:\n\n```python\nfrom llm_converter import FileConverter\n\nconverter = FileConverter()\nresult = converter.convert(\"document.pdf\")\n\n# Enhanced JSON with Ollama (when available)\njson_data = result.to_json()\nprint(json_data[\"format\"]) # \"ollama_structured_json\" or \"structured_json\"\n\n# The enhanced conversion provides:\n# - Better document structure understanding\n# - Intelligent table parsing\n# - Automatic metadata extraction \n# - Key information identification\n# - Proper data type handling\n```\n\n**Requirements for enhanced JSON (if using cpu_preference=True):**\n- Install: `pip install 'llm-data-converter[local-llm]'`\n- [Install Ollama](https://ollama.ai/) and run: `ollama serve`\n- Pull a model: `ollama pull llama3.2`\n\n*If Ollama is not available, the library automatically falls back to the standard JSON parser.*\n\n### Extract Specific Fields & Structured Data\n\n```python\n# Extract specific fields from any document\nresult = converter.convert(\"invoice.pdf\")\n\n# Method 1: Extract specific fields\nextracted = result.to_json(specified_fields=[\n \"invoice_number\", \n \"total_amount\", \n \"vendor_name\",\n \"due_date\"\n])\n\n# Method 2: Extract using JSON schema\nschema = {\n \"invoice_number\": \"string\",\n \"total_amount\": \"number\", \n \"vendor_name\": \"string\",\n \"line_items\": [{\n \"description\": \"string\",\n \"amount\": \"number\"\n }]\n}\n\nstructured = result.to_json(json_schema=schema)\n```\n\n**How it works:**\n- Automatically uses cloud API when available\n- Falls back to local Ollama for privacy-focused processing\n- Same interface works for both cloud and local modes\n\n**Cloud Mode Usage Examples:**\n\n```python\nfrom llm_converter import FileConverter\n\n# Default cloud mode (rate-limited without API key)\nconverter = FileConverter()\n\n# With API key for unlimited access\nconverter = FileConverter(api_key=\"your_api_key_here\")\n\n# Extract specific fields from invoice\nresult = converter.convert(\"invoice.pdf\")\n\n# Extract key invoice information\ninvoice_fields = result.to_json(specified_fields=[\n \"invoice_number\",\n \"total_amount\", \n \"vendor_name\",\n \"due_date\",\n \"items_count\"\n])\n\nprint(\"Extracted Invoice Fields:\")\nprint(invoice_fields)\n# Output: {\"extracted_fields\": {\"invoice_number\": \"INV-001\", ...}, \"format\": \"specified_fields\"}\n\n# Extract structured data using schema\ninvoice_schema = {\n \"invoice_number\": \"string\",\n \"total_amount\": \"number\",\n \"vendor_name\": \"string\",\n \"billing_address\": {\n \"street\": \"string\",\n \"city\": \"string\", \n \"zip_code\": \"string\"\n },\n \"line_items\": [{\n \"description\": \"string\",\n \"quantity\": \"number\",\n \"unit_price\": \"number\",\n \"total\": \"number\"\n }],\n \"taxes\": {\n \"tax_rate\": \"number\",\n \"tax_amount\": \"number\"\n }\n}\n\nstructured_invoice = result.to_json(json_schema=invoice_schema)\nprint(\"Structured Invoice Data:\")\nprint(structured_invoice)\n# Output: {\"structured_data\": {...}, \"schema\": {...}, \"format\": \"structured_json\"}\n\n# Extract from different document types\nreceipt = converter.convert(\"receipt.jpg\")\nreceipt_data = receipt.to_json(specified_fields=[\n \"merchant_name\", \"total_amount\", \"date\", \"payment_method\"\n])\n\ncontract = converter.convert(\"contract.pdf\") \ncontract_schema = {\n \"parties\": [{\n \"name\": \"string\",\n \"role\": \"string\"\n }],\n \"contract_value\": \"number\",\n \"start_date\": \"string\",\n \"end_date\": \"string\",\n \"key_terms\": [\"string\"]\n}\ncontract_data = contract.to_json(json_schema=contract_schema)\n```\n\n**Local extraction requirements (if using cpu_preference=True):**\n- Install ollama package: `pip install 'llm-data-converter[local-llm]'`\n- [Install Ollama](https://ollama.ai/) and run: `ollama serve`\n- Pull a model: `ollama pull llama3.2`\n\n### Chain with LLM\n\n```python\n# Perfect for LLM workflows\ndocument_text = converter.convert(\"research_paper.pdf\").to_markdown()\n\n# Use with any LLM\nresponse = your_llm_client.chat(\n messages=[{\n \"role\": \"user\", \n \"content\": f\"Summarize this research paper:\\n\\n{document_text}\"\n }]\n)\n```\n\n## Command Line Interface\n\n```bash\n# Basic conversion (cloud mode default)\nllm-converter document.pdf\n\n# With API key for unlimited access\nllm-converter document.pdf --api-key YOUR_API_KEY\n\n# Local processing modes\nllm-converter document.pdf --cpu-mode\nllm-converter document.pdf --gpu-mode\n\n# Different output formats\nllm-converter document.pdf --output json\nllm-converter document.pdf --output html\nllm-converter document.pdf --output csv\n\n# Extract specific fields\nllm-converter invoice.pdf --output json --extract-fields invoice_number total_amount\n\n# Extract with JSON schema\nllm-converter document.pdf --output json --json-schema schema.json\n\n# Multiple files\nllm-converter *.pdf --output markdown\n\n# Save to file\nllm-converter document.pdf --output-file result.md\n\n# Comprehensive field extraction examples\nllm-converter invoice.pdf --output json --extract-fields invoice_number vendor_name total_amount due_date line_items\n\n# Extract from different document types with specific fields\nllm-converter receipt.jpg --output json --extract-fields merchant_name total_amount date payment_method\n\nllm-converter contract.pdf --output json --extract-fields parties contract_value start_date end_date\n\n# Using JSON schema files for structured extraction\nllm-converter invoice.pdf --output json --json-schema invoice_schema.json\nllm-converter contract.pdf --output json --json-schema contract_schema.json\n\n# Combine with API key for unlimited access\nllm-converter document.pdf --api-key YOUR_API_KEY --output json --extract-fields title author date summary\n\n# Force local processing with field extraction (requires Ollama)\nllm-converter document.pdf --cpu-mode --output json --extract-fields key_points conclusions recommendations\n```\n\n**Example schema.json file:**\n```json\n{\n \"invoice_number\": \"string\",\n \"total_amount\": \"number\",\n \"vendor_name\": \"string\",\n \"billing_address\": {\n \"street\": \"string\",\n \"city\": \"string\",\n \"zip_code\": \"string\"\n },\n \"line_items\": [{\n \"description\": \"string\",\n \"quantity\": \"number\",\n \"unit_price\": \"number\"\n }]\n}\n```\n\n## API Reference for library\n\n### FileConverter\n\n```python\nFileConverter(\n preserve_layout: bool = True, # Preserve document structure\n include_images: bool = True, # Include image content\n ocr_enabled: bool = True, # Enable OCR processing\n api_key: str = None, # API key for unlimited cloud access\n model: str = None, # Model for cloud processing (\"gemini\", \"openapi\")\n cpu_preference: bool = False, # Force local CPU processing\n gpu_preference: bool = False # Force local GPU processing\n)\n```\n\n### ConversionResult Methods\n\n```python\nresult.to_markdown() -> str # Clean markdown output\nresult.to_json( # Structured JSON\n specified_fields: List[str] = None, # Extract specific fields\n json_schema: Dict = None # Extract with schema\n) -> Dict\nresult.to_html() -> str # Formatted HTML\nresult.to_csv() -> str # CSV format for tables\nresult.to_text() -> str # Plain text\n```\n\n## Advanced Configuration\n\n### Custom OCR Settings\n\n```python\nconverter = FileConverter(\n cpu_preference=True, # Use local processing\n ocr_enabled=True, # Enable OCR\n preserve_layout=True, # Maintain structure\n include_images=True # Process images\n)\n```\n\n### Environment Variables\n\n```bash\nexport NANONETS_API_KEY=\"your_api_key\"\n# Now all conversions use your API key automatically\n```\n\n## Contributing\n\nWe welcome contributions! Please see our [Contributing Guidelines](CONTRIBUTING.md) for details.\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## Support\n\n- **Email**: support@nanonets.com \n- **Issues**: [GitHub Issues](https://github.com/NanoNets/llm-data-converter/issues)\n- **Discussions**: [GitHub Discussions](https://github.com/NanoNets/llm-data-converter/discussions)\n\n---\n\n**Star this repo** if you find it helpful! Your support helps us improve the library. ",
"bugtrack_url": null,
"license": "MIT",
"summary": "Best open-source document to markdown converter for LLM training data. Convert PDF, Word, PowerPoint, Excel, images, URLs to clean markdown, JSON, HTML locally. Alternative to Unstructured, Docling, Marker, MarkItDown, MinerU, PaddleOCR, Tesseract",
"version": "2.2.0",
"project_urls": {
"Documentation": "https://github.com/nanonets/llm-data-converter#readme",
"Homepage": "https://github.com/nanonets/llm-data-converter",
"Issues": "https://github.com/nanonets/llm-data-converter/issues",
"Repository": "https://github.com/nanonets/llm-data-converter"
},
"split_keywords": [
"ai-training-data",
" batch-document-processing",
" docling-alternative",
" document-ai",
" document-conversion",
" document-processing",
" document-to-markdown",
" document-understanding",
" excel-to-markdown",
" html-to-markdown",
" image-processing",
" intelligent-document-processing",
" layout-detection",
" llm",
" llm-ready-data",
" local-document-processing",
" markdown",
" marker-alternative",
" markitdown-alternative",
" mineru-alternative",
" ocr",
" offline-document-converter",
" paddleocr-alternative",
" pdf",
" pdf-to-markdown",
" powerpoint-to-markdown",
" rag",
" structured-data-extraction",
" table-extraction",
" tesseract-alternative",
" text-extraction",
" unstructured-alternative",
" word-to-markdown"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "1e83051a8f73cf3e07b598a3332d58ed7f6046430f6eef288c402a222a6d2489",
"md5": "4c0f6032af473d29c76446386bb6e094",
"sha256": "04e0e425e93cd3fe77cad4a8dfe258989c0c037b6c7104003cb1d0c0432902c1"
},
"downloads": -1,
"filename": "llm_data_converter-2.2.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "4c0f6032af473d29c76446386bb6e094",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 71846,
"upload_time": "2025-07-25T13:32:05",
"upload_time_iso_8601": "2025-07-25T13:32:05.959453Z",
"url": "https://files.pythonhosted.org/packages/1e/83/051a8f73cf3e07b598a3332d58ed7f6046430f6eef288c402a222a6d2489/llm_data_converter-2.2.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "72b2acdd48fd94704ce97490d8718de5fa862be400ea6bed760bdac8507662ca",
"md5": "f4380c3f83b744ddedbb5f11c8e24aec",
"sha256": "5b294146e1346911696dcf595284738f82ccdd69d0de89d14b964c3fd83facd0"
},
"downloads": -1,
"filename": "llm_data_converter-2.2.0.tar.gz",
"has_sig": false,
"md5_digest": "f4380c3f83b744ddedbb5f11c8e24aec",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 55853,
"upload_time": "2025-07-25T13:32:07",
"upload_time_iso_8601": "2025-07-25T13:32:07.929437Z",
"url": "https://files.pythonhosted.org/packages/72/b2/acdd48fd94704ce97490d8718de5fa862be400ea6bed760bdac8507662ca/llm_data_converter-2.2.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-25 13:32:07",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "nanonets",
"github_project": "llm-data-converter#readme",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "llm-data-converter"
}