vision-parse

Name	vision-parse JSON
Version	0.1.13 JSON
	download
home_page	None
Summary	Parse PDF documents into markdown formatted content using Vision LLMs
upload_time	2025-02-02 13:19:59
maintainer	None
docs_url	None
author	None
requires_python	>=3.9
license	None
keywords	document parser markdown ocr pdf pdf to markdown vision llm
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            <div align='center'>

# Vision Parse ✨

[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT)
[![Author: Arun Brahma](https://img.shields.io/badge/Author-Arun%20Brahma-purple)](https://github.com/iamarunbrahma)
[![PyPI version](https://img.shields.io/pypi/v/vision-parse.svg)](https://pypi.org/project/vision-parse/)

> Parse PDF documents into beautifully formatted markdown content using state-of-the-art Vision Language Models - all with just a few lines of code!

[Getting Started](#-getting-started) •
[Usage](#-usage) •
[Supported Models](#-supported-models) •
[Parameters](#-customization-parameters) •
[Benchmarks](#-benchmarks)
</div>

## 🎯 Introduction

Vision Parse harnesses the power of Vision Language Models to revolutionize document processing:

- 📝 **Scanned Document Processing**: Intelligently identifies and extracts text, tables, and LaTeX equations from scanned documents into markdown-formatted content with high precision
- 🎨 **Advanced Content Formatting**: Preserves LaTeX equations, hyperlinks, images, and document hierarchy for markdown-formatted content
- 🤖 **Multi-LLM Support**: Seamlessly integrates with multiple Vision LLM providers such as OpenAI, Gemini, and Llama for optimal accuracy and speed
- 📁 **Local Model Hosting**: Supports local model hosting with Ollama for secure, no-cost, private, and offline document processing


## 🚀 Getting Started

### Prerequisites

- 🐍 Python >= 3.9
- 🖥️ Ollama (if you want to use local models)
- 🤖 API Key for OpenAI or Google Gemini (if you want to use OpenAI or Google Gemini)

### Installation

**Install the core package using pip (Recommended):**

```bash
pip install vision-parse
```

**Install the additional dependencies for OpenAI or Gemini:**

```bash
# To install all the additional dependencies
pip install 'vision-parse[all]'
```

**Install the package from source:**

```bash
pip install 'git+https://github.com/iamarunbrahma/vision-parse.git#egg=vision-parse[all]'
```

### Setting up Ollama (Optional)
See [Ollama Setup Guide](docs/ollama_setup.md) on how to setup Ollama locally.

> [!IMPORTANT]
> While Ollama provides free local model hosting, please note that vision models from Ollama can be significantly slower in processing documents and may not produce optimal results when handling complex PDF documents. For better accuracy and performance with complex layouts in PDF documents, consider using API-based models like OpenAI or Gemini.

### Setting up Vision Parse with Docker (Optional)
Check out [Docker Setup Guide](docs/docker_setup.md) on how to setup Vision Parse with Docker.

## 📚 Usage

### Basic Example Usage

```python
from vision_parse import VisionParser

# Initialize parser
parser = VisionParser(
    model_name="llama3.2-vision:11b", # For local models, you don't need to provide the api key
    temperature=0.4,
    top_p=0.5,
    image_mode="url", # Image mode can be "url", "base64" or None
    detailed_extraction=False, # Set to True for more detailed extraction
    enable_concurrency=False, # Set to True for parallel processing
)

# Convert PDF to markdown
pdf_path = "input_document.pdf" # local path to your pdf file
markdown_pages = parser.convert_pdf(pdf_path)

# Process results
for i, page_content in enumerate(markdown_pages):
    print(f"\n--- Page {i+1} ---\n{page_content}")
```

### Customize Ollama configuration for better performance

```python
from vision_parse import VisionParser

custom_prompt = """
Strictly preserve markdown formatting during text extraction from scanned document.
"""

# Initialize parser with Ollama configuration
parser = VisionParser(
    model_name="llama3.2-vision:11b",
    temperature=0.7,
    top_p=0.6,
    num_ctx=4096,
    image_mode="base64",
    custom_prompt=custom_prompt,
    detailed_extraction=True,
    ollama_config={
        "OLLAMA_NUM_PARALLEL": 8,
        "OLLAMA_REQUEST_TIMEOUT": 240,
    },
    enable_concurrency=True,
)

# Convert PDF to markdown
pdf_path = "input_document.pdf" # local path to your pdf file
markdown_pages = parser.convert_pdf(pdf_path)
```
> [!TIP]
> Please refer to [FAQs](docs/faq.md) for more details on how to improve the performance of locally hosted vision models.

### API Models Usage (OpenAI, Azure OpenAI, Gemini, DeepSeek)

```python
from vision_parse import VisionParser


# Initialize parser with OpenAI model
parser = VisionParser(
    model_name="gpt-4o",
    api_key="your-openai-api-key", # Get the OpenAI API key from https://platform.openai.com/api-keys
    temperature=0.7,
    top_p=0.4,
    image_mode="url",
    detailed_extraction=False, # Set to True for more detailed extraction
    enable_concurrency=True,
)

# Initialize parser with Azure OpenAI model
parser = VisionParser(
    model_name="gpt-4o",
    image_mode="url",
    detailed_extraction=False, # Set to True for more detailed extraction
    enable_concurrency=True,
    openai_config={
        "AZURE_ENDPOINT_URL": "https://****.openai.azure.com/", # replace with your azure endpoint url
        "AZURE_DEPLOYMENT_NAME": "*******", # replace with azure deployment name, if needed
        "AZURE_OPENAI_API_KEY": "***********", # replace with your azure openai api key
        "AZURE_OPENAI_API_VERSION": "2024-08-01-preview", # replace with latest azure openai api version
    },
)


# Initialize parser with Google Gemini model
parser = VisionParser(
    model_name="gemini-1.5-flash",
    api_key="your-gemini-api-key", # Get the Gemini API key from https://aistudio.google.com/app/apikey
    temperature=0.7,
    top_p=0.4,
    image_mode="url",
    detailed_extraction=False, # Set to True for more detailed extraction
    enable_concurrency=True,
)

# Initialize parser with DeepSeek model
parser = VisionParser(
    model_name="deepseek-chat",
    api_key="your-deepseek-api-key", # Get the DeepSeek API key from https://platform.deepseek.com/api_keys
    temperature=0.7,
    top_p=0.4,
    image_mode="url",
    detailed_extraction=False, # Set to True for more detailed extraction
    enable_concurrency=True,
)
```

## ✅ Supported Models

This package supports the following Vision LLM models:

| **Model Name** | **Provider Name** |
|:------------:|:----------:|
| gpt-4o | OpenAI |
| gpt-4o-mini | OpenAI |
| gemini-1.5-flash | Google |
| gemini-2.0-flash-exp | Google |
| gemini-1.5-pro | Google |
| llava:13b | Ollama |
| llava:34b | Ollama |
| llama3.2-vision:11b | Ollama |
| llama3.2-vision:70b | Ollama |
| deepseek-chat | DeepSeek |

## 🔧 Customization Parameters

Vision Parse offers several customization parameters to enhance document processing:

| **Parameter** | **Description** | **Value Type** |
|:---------:|:-----------:|:-------------:|
| model_name | Name of the Vision LLM model to use | str |
| custom_prompt | Define custom prompt for the model and it will be used as a suffix to the default prompt | str |
| ollama_config | Specify custom configuration for Ollama client initialization | dict |
| openai_config | Specify custom configuration for OpenAI, Azure OpenAI or DeepSeek client initialization | dict |
| gemini_config | Specify custom configuration for Gemini client initialization | dict |
| image_mode | Sets the image output format for the model i.e. if you want image url in markdown content or base64 encoded image | str |
| detailed_extraction | Enable advanced content extraction to extract complex information such as LaTeX equations, tables, images, etc. | bool |
| enable_concurrency | Enable parallel processing of multiple pages in a PDF document in a single request | bool |

> [!TIP]
> For more details on custom model configuration i.e. `openai_config`, `gemini_config`, and `ollama_config`; please refer to [Model Configuration](docs/model_config.md).

## 📊 Benchmarks

I conducted benchmarking to evaluate Vision Parse's performance against MarkItDown and Nougat. The benchmarking was conducted using a curated dataset of 100 diverse machine learning papers from arXiv, and the Marker library was used to generate the ground truth markdown formatted data.

Since there are no other ground truth data available for this task, I relied on the Marker library to generate the ground truth markdown formatted data.

### Results

| Parser | Accuracy Score |
|:-------:|:---------------:|
| Vision Parse | 92% |
| MarkItDown | 67% |
| Nougat | 79% |

> [!NOTE]
> I used gpt-4o model for Vision Parse to extract markdown content from the pdf documents. I have used model parameter settings as in `scoring.py` script. The above results may vary depending on the model you choose for Vision Parse and the model parameter settings.

### Run Your Own Benchmarks

You can benchmark the performance of Vision Parse on your machine using your own dataset. Run `scoring.py` to generate a detailed comparison report in the output directory.

1. Install packages from requirements.txt:
```bash
pip install --no-cache-dir -r benchmarks/requirements.txt
```

2. Run the benchmark script:
```bash
# Change `pdf_path` to your pdf file path and `benchmark_results_path` to your desired output path
python benchmarks/scoring.py
```

## 🤝 Contributing

Contributions to Vision Parse are welcome! Whether you're fixing bugs, adding new features, or creating example notebooks, your help is appreciated. Please check out [contributing guidelines](CONTRIBUTING.md) for instructions on setting up the development environment, code style requirements, and the pull request process.

## 📄 License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "vision-parse",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "document parser, markdown, ocr, pdf, pdf to markdown, vision llm",
    "author": null,
    "author_email": "Arun Brahma <mithubrahma@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/15/51/138e9897ffcdd98baddfde8394f76d04b07f08be41e96bf401791aedd2cf/vision_parse-0.1.13.tar.gz",
    "platform": null,
    "description": "<div align='center'>\n\n# Vision Parse \u2728\n\n[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT)\n[![Author: Arun Brahma](https://img.shields.io/badge/Author-Arun%20Brahma-purple)](https://github.com/iamarunbrahma)\n[![PyPI version](https://img.shields.io/pypi/v/vision-parse.svg)](https://pypi.org/project/vision-parse/)\n\n> Parse PDF documents into beautifully formatted markdown content using state-of-the-art Vision Language Models - all with just a few lines of code!\n\n[Getting Started](#-getting-started) \u2022\n[Usage](#-usage) \u2022\n[Supported Models](#-supported-models) \u2022\n[Parameters](#-customization-parameters) \u2022\n[Benchmarks](#-benchmarks)\n</div>\n\n## \ud83c\udfaf Introduction\n\nVision Parse harnesses the power of Vision Language Models to revolutionize document processing:\n\n- \ud83d\udcdd **Scanned Document Processing**: Intelligently identifies and extracts text, tables, and LaTeX equations from scanned documents into markdown-formatted content with high precision\n- \ud83c\udfa8 **Advanced Content Formatting**: Preserves LaTeX equations, hyperlinks, images, and document hierarchy for markdown-formatted content\n- \ud83e\udd16 **Multi-LLM Support**: Seamlessly integrates with multiple Vision LLM providers such as OpenAI, Gemini, and Llama for optimal accuracy and speed\n- \ud83d\udcc1 **Local Model Hosting**: Supports local model hosting with Ollama for secure, no-cost, private, and offline document processing\n\n\n## \ud83d\ude80 Getting Started\n\n### Prerequisites\n\n- \ud83d\udc0d Python >= 3.9\n- \ud83d\udda5\ufe0f Ollama (if you want to use local models)\n- \ud83e\udd16 API Key for OpenAI or Google Gemini (if you want to use OpenAI or Google Gemini)\n\n### Installation\n\n**Install the core package using pip (Recommended):**\n\n```bash\npip install vision-parse\n```\n\n**Install the additional dependencies for OpenAI or Gemini:**\n\n```bash\n# To install all the additional dependencies\npip install 'vision-parse[all]'\n```\n\n**Install the package from source:**\n\n```bash\npip install 'git+https://github.com/iamarunbrahma/vision-parse.git#egg=vision-parse[all]'\n```\n\n### Setting up Ollama (Optional)\nSee [Ollama Setup Guide](docs/ollama_setup.md) on how to setup Ollama locally.\n\n> [!IMPORTANT]\n> While Ollama provides free local model hosting, please note that vision models from Ollama can be significantly slower in processing documents and may not produce optimal results when handling complex PDF documents. For better accuracy and performance with complex layouts in PDF documents, consider using API-based models like OpenAI or Gemini.\n\n### Setting up Vision Parse with Docker (Optional)\nCheck out [Docker Setup Guide](docs/docker_setup.md) on how to setup Vision Parse with Docker.\n\n## \ud83d\udcda Usage\n\n### Basic Example Usage\n\n```python\nfrom vision_parse import VisionParser\n\n# Initialize parser\nparser = VisionParser(\n    model_name=\"llama3.2-vision:11b\", # For local models, you don't need to provide the api key\n    temperature=0.4,\n    top_p=0.5,\n    image_mode=\"url\", # Image mode can be \"url\", \"base64\" or None\n    detailed_extraction=False, # Set to True for more detailed extraction\n    enable_concurrency=False, # Set to True for parallel processing\n)\n\n# Convert PDF to markdown\npdf_path = \"input_document.pdf\" # local path to your pdf file\nmarkdown_pages = parser.convert_pdf(pdf_path)\n\n# Process results\nfor i, page_content in enumerate(markdown_pages):\n    print(f\"\\n--- Page {i+1} ---\\n{page_content}\")\n```\n\n### Customize Ollama configuration for better performance\n\n```python\nfrom vision_parse import VisionParser\n\ncustom_prompt = \"\"\"\nStrictly preserve markdown formatting during text extraction from scanned document.\n\"\"\"\n\n# Initialize parser with Ollama configuration\nparser = VisionParser(\n    model_name=\"llama3.2-vision:11b\",\n    temperature=0.7,\n    top_p=0.6,\n    num_ctx=4096,\n    image_mode=\"base64\",\n    custom_prompt=custom_prompt,\n    detailed_extraction=True,\n    ollama_config={\n        \"OLLAMA_NUM_PARALLEL\": 8,\n        \"OLLAMA_REQUEST_TIMEOUT\": 240,\n    },\n    enable_concurrency=True,\n)\n\n# Convert PDF to markdown\npdf_path = \"input_document.pdf\" # local path to your pdf file\nmarkdown_pages = parser.convert_pdf(pdf_path)\n```\n> [!TIP]\n> Please refer to [FAQs](docs/faq.md) for more details on how to improve the performance of locally hosted vision models.\n\n### API Models Usage (OpenAI, Azure OpenAI, Gemini, DeepSeek)\n\n```python\nfrom vision_parse import VisionParser\n\n\n# Initialize parser with OpenAI model\nparser = VisionParser(\n    model_name=\"gpt-4o\",\n    api_key=\"your-openai-api-key\", # Get the OpenAI API key from https://platform.openai.com/api-keys\n    temperature=0.7,\n    top_p=0.4,\n    image_mode=\"url\",\n    detailed_extraction=False, # Set to True for more detailed extraction\n    enable_concurrency=True,\n)\n\n# Initialize parser with Azure OpenAI model\nparser = VisionParser(\n    model_name=\"gpt-4o\",\n    image_mode=\"url\",\n    detailed_extraction=False, # Set to True for more detailed extraction\n    enable_concurrency=True,\n    openai_config={\n        \"AZURE_ENDPOINT_URL\": \"https://****.openai.azure.com/\", # replace with your azure endpoint url\n        \"AZURE_DEPLOYMENT_NAME\": \"*******\", # replace with azure deployment name, if needed\n        \"AZURE_OPENAI_API_KEY\": \"***********\", # replace with your azure openai api key\n        \"AZURE_OPENAI_API_VERSION\": \"2024-08-01-preview\", # replace with latest azure openai api version\n    },\n)\n\n\n# Initialize parser with Google Gemini model\nparser = VisionParser(\n    model_name=\"gemini-1.5-flash\",\n    api_key=\"your-gemini-api-key\", # Get the Gemini API key from https://aistudio.google.com/app/apikey\n    temperature=0.7,\n    top_p=0.4,\n    image_mode=\"url\",\n    detailed_extraction=False, # Set to True for more detailed extraction\n    enable_concurrency=True,\n)\n\n# Initialize parser with DeepSeek model\nparser = VisionParser(\n    model_name=\"deepseek-chat\",\n    api_key=\"your-deepseek-api-key\", # Get the DeepSeek API key from https://platform.deepseek.com/api_keys\n    temperature=0.7,\n    top_p=0.4,\n    image_mode=\"url\",\n    detailed_extraction=False, # Set to True for more detailed extraction\n    enable_concurrency=True,\n)\n```\n\n## \u2705 Supported Models\n\nThis package supports the following Vision LLM models:\n\n| **Model Name** | **Provider Name** |\n|:------------:|:----------:|\n| gpt-4o | OpenAI |\n| gpt-4o-mini | OpenAI |\n| gemini-1.5-flash | Google |\n| gemini-2.0-flash-exp | Google |\n| gemini-1.5-pro | Google |\n| llava:13b | Ollama |\n| llava:34b | Ollama |\n| llama3.2-vision:11b | Ollama |\n| llama3.2-vision:70b | Ollama |\n| deepseek-chat | DeepSeek |\n\n## \ud83d\udd27 Customization Parameters\n\nVision Parse offers several customization parameters to enhance document processing:\n\n| **Parameter** | **Description** | **Value Type** |\n|:---------:|:-----------:|:-------------:|\n| model_name | Name of the Vision LLM model to use | str |\n| custom_prompt | Define custom prompt for the model and it will be used as a suffix to the default prompt | str |\n| ollama_config | Specify custom configuration for Ollama client initialization | dict |\n| openai_config | Specify custom configuration for OpenAI, Azure OpenAI or DeepSeek client initialization | dict |\n| gemini_config | Specify custom configuration for Gemini client initialization | dict |\n| image_mode | Sets the image output format for the model i.e. if you want image url in markdown content or base64 encoded image | str |\n| detailed_extraction | Enable advanced content extraction to extract complex information such as LaTeX equations, tables, images, etc. | bool |\n| enable_concurrency | Enable parallel processing of multiple pages in a PDF document in a single request | bool |\n\n> [!TIP]\n> For more details on custom model configuration i.e. `openai_config`, `gemini_config`, and `ollama_config`; please refer to [Model Configuration](docs/model_config.md).\n\n## \ud83d\udcca Benchmarks\n\nI conducted benchmarking to evaluate Vision Parse's performance against MarkItDown and Nougat. The benchmarking was conducted using a curated dataset of 100 diverse machine learning papers from arXiv, and the Marker library was used to generate the ground truth markdown formatted data.\n\nSince there are no other ground truth data available for this task, I relied on the Marker library to generate the ground truth markdown formatted data.\n\n### Results\n\n| Parser | Accuracy Score |\n|:-------:|:---------------:|\n| Vision Parse | 92% |\n| MarkItDown | 67% |\n| Nougat | 79% |\n\n> [!NOTE]\n> I used gpt-4o model for Vision Parse to extract markdown content from the pdf documents. I have used model parameter settings as in `scoring.py` script. The above results may vary depending on the model you choose for Vision Parse and the model parameter settings.\n\n### Run Your Own Benchmarks\n\nYou can benchmark the performance of Vision Parse on your machine using your own dataset. Run `scoring.py` to generate a detailed comparison report in the output directory.\n\n1. Install packages from requirements.txt:\n```bash\npip install --no-cache-dir -r benchmarks/requirements.txt\n```\n\n2. Run the benchmark script:\n```bash\n# Change `pdf_path` to your pdf file path and `benchmark_results_path` to your desired output path\npython benchmarks/scoring.py\n```\n\n## \ud83e\udd1d Contributing\n\nContributions to Vision Parse are welcome! Whether you're fixing bugs, adding new features, or creating example notebooks, your help is appreciated. Please check out [contributing guidelines](CONTRIBUTING.md) for instructions on setting up the development environment, code style requirements, and the pull request process.\n\n## \ud83d\udcc4 License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Parse PDF documents into markdown formatted content using Vision LLMs",
    "version": "0.1.13",
    "project_urls": {
        "Homepage": "https://github.com/iamarunbrahma/vision-parse",
        "Repository": "https://github.com/iamarunbrahma/vision-parse.git"
    },
    "split_keywords": [
        "document parser",
        " markdown",
        " ocr",
        " pdf",
        " pdf to markdown",
        " vision llm"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "2f92bcde76d43be2abafd54c2a7958caa19b0cccab37a24c23ff40e1e5603cb9",
                "md5": "4577fb4f3829bddb3f5148b206ae88d7",
                "sha256": "ae81079de3af9b2e734b514ed89c2d18128f64246788dfabd06a456bb07016f4"
            },
            "downloads": -1,
            "filename": "vision_parse-0.1.13-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "4577fb4f3829bddb3f5148b206ae88d7",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 16778,
            "upload_time": "2025-02-02T13:19:58",
            "upload_time_iso_8601": "2025-02-02T13:19:58.864954Z",
            "url": "https://files.pythonhosted.org/packages/2f/92/bcde76d43be2abafd54c2a7958caa19b0cccab37a24c23ff40e1e5603cb9/vision_parse-0.1.13-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "1551138e9897ffcdd98baddfde8394f76d04b07f08be41e96bf401791aedd2cf",
                "md5": "6617d9b8ff2c1157fa2d7c8d5a8cbe0c",
                "sha256": "226eaac7e8856b5b389a042d72d78487479fe8ef3652539e04e9dfd1e23aa630"
            },
            "downloads": -1,
            "filename": "vision_parse-0.1.13.tar.gz",
            "has_sig": false,
            "md5_digest": "6617d9b8ff2c1157fa2d7c8d5a8cbe0c",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 52459,
            "upload_time": "2025-02-02T13:19:59",
            "upload_time_iso_8601": "2025-02-02T13:19:59.998099Z",
            "url": "https://files.pythonhosted.org/packages/15/51/138e9897ffcdd98baddfde8394f76d04b07f08be41e96bf401791aedd2cf/vision_parse-0.1.13.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-02-02 13:19:59",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "iamarunbrahma",
    "github_project": "vision-parse",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "vision-parse"
}

None