# LLM Data Converter
[](https://badge.fury.io/py/llm-data-converter)
[](https://pepy.tech/project/llm-data-converter)
[](https://pypi.org/project/llm-data-converter/)
Convert any document format into LLM-ready data format (markdown) with advanced intelligent document processing capabilities powered by pre-trained models.
## Installation
```bash
pip install llm-data-converter
```
**Requirements:**
- Python 3.8 or higher
### System Dependencies for Intelligent Document Processing
For this library to work properly, you may need to install additional system dependencies:
**Ubuntu/Debian:**
```bash
sudo apt update
sudo apt install -y libgl1 libglib2.0-0 libgomp1
pip install setuptools
```
**macOS:**
```bash
# Usually not needed, but if you encounter OpenGL issues:
brew install mesa
```
**Note:** The package will automatically download and cache intelligent models on first use.
## Quick Start
```python
from llm_converter import FileConverter
# Basic conversion
converter = FileConverter()
result = converter.convert("document.pdf").to_markdown()
print(result)
```
## Features
- **Multiple Input Formats**: PDF, DOCX, TXT, HTML, URLs, Excel files, and more
- **Multiple Output Formats**: Markdown, HTML, JSON, Plain Text
- **LLM Integration**: Seamless integration with LiteLLM and other LLM libraries
- **Local Processing**: Process documents locally without external dependencies
- **Layout Preservation**: Maintain document structure and formatting
- **Intelligent Document Processing**: Advanced document understanding and conversion powered by pre-trained models:
- **Layout Detection**: Intelligent models for document structure understanding
- **Text Recognition**: High-accuracy text extraction with confidence scoring
- **Table Structure**: Intelligent table detection and conversion to markdown format
- **Automatic Model Download**: Models are automatically downloaded and cached
## Usage Examples
### Convert PDF to Markdown
```python
from llm_converter import FileConverter
converter = FileConverter()
result = converter.convert("document.pdf").to_markdown()
print(result)
```
### Convert Image to HTML
```python
from llm_converter import FileConverter
converter = FileConverter()
result = converter.convert("sample.png").to_html()
print(result)
```
### Chain with LLM
```python
from llm_converter import FileConverter
from litellm import completion
converter = FileConverter()
document_content = converter.convert("report.pdf").to_markdown()
# Use with any LLM
response = completion(
model="openai/gpt-4o",
messages=[
{"role": "system", "content": "You are a helpful assistant that analyzes documents."},
{"role": "user", "content": f"Summarize this document:\n\n{document_content}"}
]
)
print(response.choices[0].message.content)
```
## Supported Formats
### Input Formats
- **Documents**: PDF, DOCX, TXT
- **Web**: URLs, HTML files
- **Data**: Excel (XLSX, XLS), CSV
- **Images**: PNG, JPG, JPEG
### Output Formats
- **Markdown**: Clean, structured markdown with proper table formatting
- **HTML**: Formatted HTML with styling
- **JSON**: Structured JSON data
- **Plain Text**: Simple text extraction
## CLI usage
The `llm-converter` command-line tool provides easy access to all conversion features:
### Basic Usage
```bash
# Convert a PDF to markdown (default)
llm-converter document.pdf
# Convert to different output formats
llm-converter document.pdf --output html
llm-converter document.pdf --output json
llm-converter document.pdf --output text
```
### Advanced Options
```bash
# Save output to file
llm-converter document.pdf --output-file output.md
# For image input
llm-converter image.png
# Convert multiple files at once
llm-converter file1.pdf file2.docx file3.xlsx --output markdown
```
### List Supported Formats
```bash
# See all supported input formats
llm-converter --list-formats
```
### Examples
```bash
# Convert PDF to markdown
llm-converter scanned_document.pdf --output markdown
# Convert image to HTML with layout preservation
llm-converter screenshot.png --output html
# Convert multiple documents to JSON
llm-converter report.pdf presentation.pptx data.xlsx --output json --output-file combined.json
# Convert URL content to markdown
llm-converter https://blog.example.com --output markdown --output-file blog_content.md
```
### Output Formats
- **markdown** (default): Clean, structured markdown
- **html**: Formatted HTML with styling
- **json**: Structured JSON data
- **text**: Plain text extraction
## API Reference for library
### FileConverter
Main class for converting documents to LLM-ready formats.
#### Methods
- `convert(file_path: str) -> ConversionResult`: Convert a file to internal format
- `convert_url(url: str) -> ConversionResult`: Convert a URL page contents to internal format
- `convert_text(text: str) -> ConversionResult`: Convert plain text to internal format
### ConversionResult
Result object with methods to export to different formats.
#### Methods
- `to_markdown() -> str`: Export as markdown
- `to_html() -> str`: Export as HTML
- `to_json() -> dict`: Export as JSON
- `to_text() -> str`: Export as plain text
## License
MIT License - see LICENSE file for details.
Raw data
{
"_id": null,
"home_page": null,
"name": "llm-data-converter",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "document-conversion, document-processing, document-understanding, image-processing, intelligent-document-processing, llm, markdown, pdf",
"author": null,
"author_email": "Nanonets <team@nanonets.com>",
"download_url": "https://files.pythonhosted.org/packages/76/75/a186b41c97eedf7ea0ca9d831cdfbd2003b605e5af3b3e05beb96ab00150/llm_data_converter-2.1.2.tar.gz",
"platform": null,
"description": "# LLM Data Converter\n\n[](https://badge.fury.io/py/llm-data-converter)\n[](https://pepy.tech/project/llm-data-converter)\n[](https://pypi.org/project/llm-data-converter/)\n\nConvert any document format into LLM-ready data format (markdown) with advanced intelligent document processing capabilities powered by pre-trained models.\n\n## Installation\n\n```bash\npip install llm-data-converter\n```\n\n**Requirements:**\n- Python 3.8 or higher\n\n### System Dependencies for Intelligent Document Processing\n\nFor this library to work properly, you may need to install additional system dependencies:\n\n**Ubuntu/Debian:**\n```bash\nsudo apt update\nsudo apt install -y libgl1 libglib2.0-0 libgomp1\npip install setuptools\n```\n\n**macOS:**\n```bash\n# Usually not needed, but if you encounter OpenGL issues:\nbrew install mesa\n```\n\n**Note:** The package will automatically download and cache intelligent models on first use.\n\n## Quick Start\n\n```python\nfrom llm_converter import FileConverter\n\n# Basic conversion \nconverter = FileConverter()\nresult = converter.convert(\"document.pdf\").to_markdown()\nprint(result)\n```\n\n## Features\n\n- **Multiple Input Formats**: PDF, DOCX, TXT, HTML, URLs, Excel files, and more\n- **Multiple Output Formats**: Markdown, HTML, JSON, Plain Text\n- **LLM Integration**: Seamless integration with LiteLLM and other LLM libraries\n- **Local Processing**: Process documents locally without external dependencies\n- **Layout Preservation**: Maintain document structure and formatting\n- **Intelligent Document Processing**: Advanced document understanding and conversion powered by pre-trained models:\n - **Layout Detection**: Intelligent models for document structure understanding\n - **Text Recognition**: High-accuracy text extraction with confidence scoring\n - **Table Structure**: Intelligent table detection and conversion to markdown format\n - **Automatic Model Download**: Models are automatically downloaded and cached\n\n\n## Usage Examples\n\n### Convert PDF to Markdown\n\n```python\nfrom llm_converter import FileConverter\n\nconverter = FileConverter()\nresult = converter.convert(\"document.pdf\").to_markdown()\nprint(result)\n```\n\n\n### Convert Image to HTML\n\n```python\nfrom llm_converter import FileConverter\n\nconverter = FileConverter()\nresult = converter.convert(\"sample.png\").to_html()\nprint(result)\n```\n\n### Chain with LLM\n\n```python\nfrom llm_converter import FileConverter\nfrom litellm import completion\n\nconverter = FileConverter()\ndocument_content = converter.convert(\"report.pdf\").to_markdown()\n\n# Use with any LLM\nresponse = completion(\n model=\"openai/gpt-4o\",\n messages=[\n {\"role\": \"system\", \"content\": \"You are a helpful assistant that analyzes documents.\"},\n {\"role\": \"user\", \"content\": f\"Summarize this document:\\n\\n{document_content}\"}\n ]\n)\n\nprint(response.choices[0].message.content)\n```\n\n## Supported Formats\n\n### Input Formats\n- **Documents**: PDF, DOCX, TXT\n- **Web**: URLs, HTML files\n- **Data**: Excel (XLSX, XLS), CSV\n- **Images**: PNG, JPG, JPEG \n\n### Output Formats\n- **Markdown**: Clean, structured markdown with proper table formatting\n- **HTML**: Formatted HTML with styling\n- **JSON**: Structured JSON data\n- **Plain Text**: Simple text extraction\n\n\n## CLI usage\n\nThe `llm-converter` command-line tool provides easy access to all conversion features:\n\n### Basic Usage\n\n```bash\n# Convert a PDF to markdown (default)\nllm-converter document.pdf\n\n# Convert to different output formats\nllm-converter document.pdf --output html\nllm-converter document.pdf --output json\nllm-converter document.pdf --output text\n\n\n```\n\n### Advanced Options\n\n```bash\n# Save output to file\nllm-converter document.pdf --output-file output.md\n\n# For image input\nllm-converter image.png \n\n# Convert multiple files at once\nllm-converter file1.pdf file2.docx file3.xlsx --output markdown\n```\n\n### List Supported Formats\n\n```bash\n# See all supported input formats\nllm-converter --list-formats\n```\n\n### Examples\n\n```bash\n# Convert PDF to markdown\nllm-converter scanned_document.pdf --output markdown\n\n# Convert image to HTML with layout preservation\nllm-converter screenshot.png --output html\n\n# Convert multiple documents to JSON\nllm-converter report.pdf presentation.pptx data.xlsx --output json --output-file combined.json\n\n# Convert URL content to markdown\nllm-converter https://blog.example.com --output markdown --output-file blog_content.md\n```\n\n### Output Formats\n\n- **markdown** (default): Clean, structured markdown\n- **html**: Formatted HTML with styling\n- **json**: Structured JSON data\n- **text**: Plain text extraction\n\n\n## API Reference for library\n\n### FileConverter\n\nMain class for converting documents to LLM-ready formats.\n\n#### Methods\n\n- `convert(file_path: str) -> ConversionResult`: Convert a file to internal format\n- `convert_url(url: str) -> ConversionResult`: Convert a URL page contents to internal format\n- `convert_text(text: str) -> ConversionResult`: Convert plain text to internal format\n\n### ConversionResult\n\nResult object with methods to export to different formats.\n\n#### Methods\n\n- `to_markdown() -> str`: Export as markdown\n- `to_html() -> str`: Export as HTML\n- `to_json() -> dict`: Export as JSON\n- `to_text() -> str`: Export as plain text\n\n\n## License\n\nMIT License - see LICENSE file for details. ",
"bugtrack_url": null,
"license": "MIT",
"summary": "Convert any document, text, or URL into LLM-ready data format with advanced intelligent document processing capabilities powered by pre-trained models",
"version": "2.1.2",
"project_urls": {
"Documentation": "https://github.com/nanonets/llm-data-converter#readme",
"Homepage": "https://github.com/nanonets/llm-data-converter",
"Issues": "https://github.com/nanonets/llm-data-converter/issues",
"Repository": "https://github.com/nanonets/llm-data-converter"
},
"split_keywords": [
"document-conversion",
" document-processing",
" document-understanding",
" image-processing",
" intelligent-document-processing",
" llm",
" markdown",
" pdf"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "400a5cd04f59abafbf7b7800cc0330e20be99797769446cae6961af15a118456",
"md5": "17a622121f4d3b2805cf57cab508dc88",
"sha256": "a91f2ac2e4e8a6438343b0fccb030a4baac12f7b5432ba6c1da0e01420888eec"
},
"downloads": -1,
"filename": "llm_data_converter-2.1.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "17a622121f4d3b2805cf57cab508dc88",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 47440,
"upload_time": "2025-07-16T13:25:47",
"upload_time_iso_8601": "2025-07-16T13:25:47.340741Z",
"url": "https://files.pythonhosted.org/packages/40/0a/5cd04f59abafbf7b7800cc0330e20be99797769446cae6961af15a118456/llm_data_converter-2.1.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "7675a186b41c97eedf7ea0ca9d831cdfbd2003b605e5af3b3e05beb96ab00150",
"md5": "c3582fc244473d73c929b0eb07737c6e",
"sha256": "e208153aaa47d7b36f49f9480bb08939f8b47e9264a582ce339a379bb4b4877e"
},
"downloads": -1,
"filename": "llm_data_converter-2.1.2.tar.gz",
"has_sig": false,
"md5_digest": "c3582fc244473d73c929b0eb07737c6e",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 35947,
"upload_time": "2025-07-16T13:25:49",
"upload_time_iso_8601": "2025-07-16T13:25:49.286808Z",
"url": "https://files.pythonhosted.org/packages/76/75/a186b41c97eedf7ea0ca9d831cdfbd2003b605e5af3b3e05beb96ab00150/llm_data_converter-2.1.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-16 13:25:49",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "nanonets",
"github_project": "llm-data-converter#readme",
"github_not_found": true,
"lcname": "llm-data-converter"
}