llm-data-converter


Namellm-data-converter JSON
Version 2.1.2 PyPI version JSON
download
home_pageNone
SummaryConvert any document, text, or URL into LLM-ready data format with advanced intelligent document processing capabilities powered by pre-trained models
upload_time2025-07-16 13:25:49
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseMIT
keywords document-conversion document-processing document-understanding image-processing intelligent-document-processing llm markdown pdf
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # LLM Data Converter

[![PyPI version](https://badge.fury.io/py/llm-data-converter.svg)](https://badge.fury.io/py/llm-data-converter)
[![Downloads](https://pepy.tech/badge/llm-data-converter)](https://pepy.tech/project/llm-data-converter)
[![Python versions](https://img.shields.io/pypi/pyversions/llm-data-converter)](https://pypi.org/project/llm-data-converter/)

Convert any document format into LLM-ready data format (markdown) with advanced intelligent document processing capabilities powered by pre-trained models.

## Installation

```bash
pip install llm-data-converter
```

**Requirements:**
- Python 3.8 or higher

### System Dependencies for Intelligent Document Processing

For this library to work properly, you may need to install additional system dependencies:

**Ubuntu/Debian:**
```bash
sudo apt update
sudo apt install -y libgl1 libglib2.0-0 libgomp1
pip install setuptools
```

**macOS:**
```bash
# Usually not needed, but if you encounter OpenGL issues:
brew install mesa
```

**Note:** The package will automatically download and cache intelligent models on first use.

## Quick Start

```python
from llm_converter import FileConverter

# Basic conversion 
converter = FileConverter()
result = converter.convert("document.pdf").to_markdown()
print(result)
```

## Features

- **Multiple Input Formats**: PDF, DOCX, TXT, HTML, URLs, Excel files, and more
- **Multiple Output Formats**: Markdown, HTML, JSON, Plain Text
- **LLM Integration**: Seamless integration with LiteLLM and other LLM libraries
- **Local Processing**: Process documents locally without external dependencies
- **Layout Preservation**: Maintain document structure and formatting
- **Intelligent Document Processing**: Advanced document understanding and conversion powered by pre-trained models:
  - **Layout Detection**: Intelligent models for document structure understanding
  - **Text Recognition**: High-accuracy text extraction with confidence scoring
  - **Table Structure**: Intelligent table detection and conversion to markdown format
  - **Automatic Model Download**: Models are automatically downloaded and cached


## Usage Examples

### Convert PDF to Markdown

```python
from llm_converter import FileConverter

converter = FileConverter()
result = converter.convert("document.pdf").to_markdown()
print(result)
```


### Convert Image to HTML

```python
from llm_converter import FileConverter

converter = FileConverter()
result = converter.convert("sample.png").to_html()
print(result)
```

### Chain with LLM

```python
from llm_converter import FileConverter
from litellm import completion

converter = FileConverter()
document_content = converter.convert("report.pdf").to_markdown()

# Use with any LLM
response = completion(
    model="openai/gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful assistant that analyzes documents."},
        {"role": "user", "content": f"Summarize this document:\n\n{document_content}"}
    ]
)

print(response.choices[0].message.content)
```

## Supported Formats

### Input Formats
- **Documents**: PDF, DOCX, TXT
- **Web**: URLs, HTML files
- **Data**: Excel (XLSX, XLS), CSV
- **Images**: PNG, JPG, JPEG 

### Output Formats
- **Markdown**: Clean, structured markdown with proper table formatting
- **HTML**: Formatted HTML with styling
- **JSON**: Structured JSON data
- **Plain Text**: Simple text extraction


## CLI usage

The `llm-converter` command-line tool provides easy access to all conversion features:

### Basic Usage

```bash
# Convert a PDF to markdown (default)
llm-converter document.pdf

# Convert to different output formats
llm-converter document.pdf --output html
llm-converter document.pdf --output json
llm-converter document.pdf --output text


```

### Advanced Options

```bash
# Save output to file
llm-converter document.pdf --output-file output.md

# For image input
llm-converter image.png 

# Convert multiple files at once
llm-converter file1.pdf file2.docx file3.xlsx --output markdown
```

### List Supported Formats

```bash
# See all supported input formats
llm-converter --list-formats
```

### Examples

```bash
# Convert PDF to markdown
llm-converter scanned_document.pdf --output markdown

# Convert image to HTML with layout preservation
llm-converter screenshot.png --output html

# Convert multiple documents to JSON
llm-converter report.pdf presentation.pptx data.xlsx --output json --output-file combined.json

# Convert URL content to markdown
llm-converter https://blog.example.com --output markdown --output-file blog_content.md
```

### Output Formats

- **markdown** (default): Clean, structured markdown
- **html**: Formatted HTML with styling
- **json**: Structured JSON data
- **text**: Plain text extraction


## API Reference for library

### FileConverter

Main class for converting documents to LLM-ready formats.

#### Methods

- `convert(file_path: str) -> ConversionResult`: Convert a file to internal format
- `convert_url(url: str) -> ConversionResult`: Convert a URL page contents to internal format
- `convert_text(text: str) -> ConversionResult`: Convert plain text to internal format

### ConversionResult

Result object with methods to export to different formats.

#### Methods

- `to_markdown() -> str`: Export as markdown
- `to_html() -> str`: Export as HTML
- `to_json() -> dict`: Export as JSON
- `to_text() -> str`: Export as plain text


## License

MIT License - see LICENSE file for details. 
            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "llm-data-converter",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "document-conversion, document-processing, document-understanding, image-processing, intelligent-document-processing, llm, markdown, pdf",
    "author": null,
    "author_email": "Nanonets <team@nanonets.com>",
    "download_url": "https://files.pythonhosted.org/packages/76/75/a186b41c97eedf7ea0ca9d831cdfbd2003b605e5af3b3e05beb96ab00150/llm_data_converter-2.1.2.tar.gz",
    "platform": null,
    "description": "# LLM Data Converter\n\n[![PyPI version](https://badge.fury.io/py/llm-data-converter.svg)](https://badge.fury.io/py/llm-data-converter)\n[![Downloads](https://pepy.tech/badge/llm-data-converter)](https://pepy.tech/project/llm-data-converter)\n[![Python versions](https://img.shields.io/pypi/pyversions/llm-data-converter)](https://pypi.org/project/llm-data-converter/)\n\nConvert any document format into LLM-ready data format (markdown) with advanced intelligent document processing capabilities powered by pre-trained models.\n\n## Installation\n\n```bash\npip install llm-data-converter\n```\n\n**Requirements:**\n- Python 3.8 or higher\n\n### System Dependencies for Intelligent Document Processing\n\nFor this library to work properly, you may need to install additional system dependencies:\n\n**Ubuntu/Debian:**\n```bash\nsudo apt update\nsudo apt install -y libgl1 libglib2.0-0 libgomp1\npip install setuptools\n```\n\n**macOS:**\n```bash\n# Usually not needed, but if you encounter OpenGL issues:\nbrew install mesa\n```\n\n**Note:** The package will automatically download and cache intelligent models on first use.\n\n## Quick Start\n\n```python\nfrom llm_converter import FileConverter\n\n# Basic conversion \nconverter = FileConverter()\nresult = converter.convert(\"document.pdf\").to_markdown()\nprint(result)\n```\n\n## Features\n\n- **Multiple Input Formats**: PDF, DOCX, TXT, HTML, URLs, Excel files, and more\n- **Multiple Output Formats**: Markdown, HTML, JSON, Plain Text\n- **LLM Integration**: Seamless integration with LiteLLM and other LLM libraries\n- **Local Processing**: Process documents locally without external dependencies\n- **Layout Preservation**: Maintain document structure and formatting\n- **Intelligent Document Processing**: Advanced document understanding and conversion powered by pre-trained models:\n  - **Layout Detection**: Intelligent models for document structure understanding\n  - **Text Recognition**: High-accuracy text extraction with confidence scoring\n  - **Table Structure**: Intelligent table detection and conversion to markdown format\n  - **Automatic Model Download**: Models are automatically downloaded and cached\n\n\n## Usage Examples\n\n### Convert PDF to Markdown\n\n```python\nfrom llm_converter import FileConverter\n\nconverter = FileConverter()\nresult = converter.convert(\"document.pdf\").to_markdown()\nprint(result)\n```\n\n\n### Convert Image to HTML\n\n```python\nfrom llm_converter import FileConverter\n\nconverter = FileConverter()\nresult = converter.convert(\"sample.png\").to_html()\nprint(result)\n```\n\n### Chain with LLM\n\n```python\nfrom llm_converter import FileConverter\nfrom litellm import completion\n\nconverter = FileConverter()\ndocument_content = converter.convert(\"report.pdf\").to_markdown()\n\n# Use with any LLM\nresponse = completion(\n    model=\"openai/gpt-4o\",\n    messages=[\n        {\"role\": \"system\", \"content\": \"You are a helpful assistant that analyzes documents.\"},\n        {\"role\": \"user\", \"content\": f\"Summarize this document:\\n\\n{document_content}\"}\n    ]\n)\n\nprint(response.choices[0].message.content)\n```\n\n## Supported Formats\n\n### Input Formats\n- **Documents**: PDF, DOCX, TXT\n- **Web**: URLs, HTML files\n- **Data**: Excel (XLSX, XLS), CSV\n- **Images**: PNG, JPG, JPEG \n\n### Output Formats\n- **Markdown**: Clean, structured markdown with proper table formatting\n- **HTML**: Formatted HTML with styling\n- **JSON**: Structured JSON data\n- **Plain Text**: Simple text extraction\n\n\n## CLI usage\n\nThe `llm-converter` command-line tool provides easy access to all conversion features:\n\n### Basic Usage\n\n```bash\n# Convert a PDF to markdown (default)\nllm-converter document.pdf\n\n# Convert to different output formats\nllm-converter document.pdf --output html\nllm-converter document.pdf --output json\nllm-converter document.pdf --output text\n\n\n```\n\n### Advanced Options\n\n```bash\n# Save output to file\nllm-converter document.pdf --output-file output.md\n\n# For image input\nllm-converter image.png \n\n# Convert multiple files at once\nllm-converter file1.pdf file2.docx file3.xlsx --output markdown\n```\n\n### List Supported Formats\n\n```bash\n# See all supported input formats\nllm-converter --list-formats\n```\n\n### Examples\n\n```bash\n# Convert PDF to markdown\nllm-converter scanned_document.pdf --output markdown\n\n# Convert image to HTML with layout preservation\nllm-converter screenshot.png --output html\n\n# Convert multiple documents to JSON\nllm-converter report.pdf presentation.pptx data.xlsx --output json --output-file combined.json\n\n# Convert URL content to markdown\nllm-converter https://blog.example.com --output markdown --output-file blog_content.md\n```\n\n### Output Formats\n\n- **markdown** (default): Clean, structured markdown\n- **html**: Formatted HTML with styling\n- **json**: Structured JSON data\n- **text**: Plain text extraction\n\n\n## API Reference for library\n\n### FileConverter\n\nMain class for converting documents to LLM-ready formats.\n\n#### Methods\n\n- `convert(file_path: str) -> ConversionResult`: Convert a file to internal format\n- `convert_url(url: str) -> ConversionResult`: Convert a URL page contents to internal format\n- `convert_text(text: str) -> ConversionResult`: Convert plain text to internal format\n\n### ConversionResult\n\nResult object with methods to export to different formats.\n\n#### Methods\n\n- `to_markdown() -> str`: Export as markdown\n- `to_html() -> str`: Export as HTML\n- `to_json() -> dict`: Export as JSON\n- `to_text() -> str`: Export as plain text\n\n\n## License\n\nMIT License - see LICENSE file for details. ",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Convert any document, text, or URL into LLM-ready data format with advanced intelligent document processing capabilities powered by pre-trained models",
    "version": "2.1.2",
    "project_urls": {
        "Documentation": "https://github.com/nanonets/llm-data-converter#readme",
        "Homepage": "https://github.com/nanonets/llm-data-converter",
        "Issues": "https://github.com/nanonets/llm-data-converter/issues",
        "Repository": "https://github.com/nanonets/llm-data-converter"
    },
    "split_keywords": [
        "document-conversion",
        " document-processing",
        " document-understanding",
        " image-processing",
        " intelligent-document-processing",
        " llm",
        " markdown",
        " pdf"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "400a5cd04f59abafbf7b7800cc0330e20be99797769446cae6961af15a118456",
                "md5": "17a622121f4d3b2805cf57cab508dc88",
                "sha256": "a91f2ac2e4e8a6438343b0fccb030a4baac12f7b5432ba6c1da0e01420888eec"
            },
            "downloads": -1,
            "filename": "llm_data_converter-2.1.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "17a622121f4d3b2805cf57cab508dc88",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 47440,
            "upload_time": "2025-07-16T13:25:47",
            "upload_time_iso_8601": "2025-07-16T13:25:47.340741Z",
            "url": "https://files.pythonhosted.org/packages/40/0a/5cd04f59abafbf7b7800cc0330e20be99797769446cae6961af15a118456/llm_data_converter-2.1.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "7675a186b41c97eedf7ea0ca9d831cdfbd2003b605e5af3b3e05beb96ab00150",
                "md5": "c3582fc244473d73c929b0eb07737c6e",
                "sha256": "e208153aaa47d7b36f49f9480bb08939f8b47e9264a582ce339a379bb4b4877e"
            },
            "downloads": -1,
            "filename": "llm_data_converter-2.1.2.tar.gz",
            "has_sig": false,
            "md5_digest": "c3582fc244473d73c929b0eb07737c6e",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 35947,
            "upload_time": "2025-07-16T13:25:49",
            "upload_time_iso_8601": "2025-07-16T13:25:49.286808Z",
            "url": "https://files.pythonhosted.org/packages/76/75/a186b41c97eedf7ea0ca9d831cdfbd2003b605e5af3b3e05beb96ab00150/llm_data_converter-2.1.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-16 13:25:49",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "nanonets",
    "github_project": "llm-data-converter#readme",
    "github_not_found": true,
    "lcname": "llm-data-converter"
}
        
Elapsed time: 0.42811s