Name | pdf-to-md-llm JSON |
Version |
2.2.1
JSON |
| download |
home_page | None |
Summary | Library and CLI to convert PDF documents to clean, well-structured Markdown using LLM-assisted processing, leveraging Antrhopic and OpenAI models for intelligent extraction of text, tables, and complex layouts. |
upload_time | 2025-10-07 02:11:57 |
maintainer | None |
docs_url | None |
author | None |
requires_python | >=3.10 |
license | MIT |
keywords |
ai
anthropic
converter
llm
markdown
openai
pdf
|
VCS |
 |
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# PDF to Markdown Converter
Library and CLI to convert PDF documents to clean, well-structured Markdown using LLM-assisted processing, leveraging Antrhopic and OpenAI models for intelligent extraction of text, tables, and complex layouts.
## Features
- **Vision Mode**: Enhanced extraction using multimodal AI for complex layouts, tables, charts, and diagrams
- **Multi-Provider Support**: Use Anthropic (Claude) or OpenAI (GPT) models
- **Smart Conversion**: Intelligently converts PDF content to clean markdown with proper formatting
- **Large File Support**: Automatically chunks large PDFs for optimal processing
- **Batch Processing**: Convert entire folders of PDFs with preserved directory structure
- **Table Preservation**: Accurately converts tables to markdown format with vision-enhanced detection
- **Structure Detection**: Automatically generates appropriate heading hierarchy
- **Dual Interface**: Use as both a CLI tool and a Python library
## Quick Start
```bash
# 1. Install with uv (recommended - faster)
uv tool install pdf-to-md-llm
# 2. Set your API key
export ANTHROPIC_API_KEY='your-api-key-here'
# 3. Convert a PDF
pdf-to-md-llm convert document.pdf --vision
```
## Installation
### Using uv (Recommended)
[uv](https://github.com/astral-sh/uv) is a fast Python package installer:
```bash
# Install the package as a tool
uv tool install pdf-to-md-llm
# Or run directly without installing
uvx pdf-to-md-llm convert document.pdf
```
### Using pip (Alternative)
```bash
pip install pdf-to-md-llm
```
## Configuration
Set your API key for at least one provider:
```bash
# For Anthropic (Claude) - recommended
export ANTHROPIC_API_KEY='your-anthropic-api-key-here'
# For OpenAI (GPT)
export OPENAI_API_KEY='your-openai-api-key-here'
```
Or create a `.env.local` file:
```bash
ANTHROPIC_API_KEY=your-anthropic-api-key-here
OPENAI_API_KEY=your-openai-api-key-here
```
### Default Models (Optimized for Cost/Quality)
The tool uses cost-effective models by default:
- **Anthropic**: `claude-3-5-haiku-20241022` ($0.80 input / $4 output per million tokens)
- **OpenAI**: `gpt-4o-mini` ($0.15 input / $0.60 output per million tokens)
These defaults provide excellent quality for most PDF conversion tasks at significantly lower cost. For complex documents requiring maximum accuracy, you can override with premium models:
```bash
# Use more powerful Anthropic model for complex documents
pdf-to-md-llm convert complex-doc.pdf --model claude-sonnet-4-20250514 --vision
# Use OpenAI's flagship model
pdf-to-md-llm convert complex-doc.pdf --provider openai --model gpt-4o --vision
```
To see all available models from your configured providers, see [List Available Models](#list-available-models).
## Usage Examples
### Basic Conversion
```bash
# Simple document conversion
pdf-to-md-llm convert document.pdf
# Specify output filename
pdf-to-md-llm convert document.pdf output.md
```
### Scenario 1: Academic Papers with Tables
For research papers, technical documents, or any PDF with complex tables:
```bash
# Vision mode provides superior table extraction
pdf-to-md-llm convert research-paper.pdf --vision
```
### Scenario 2: Large Documents (500+ pages)
For textbooks, manuals, or large documents, use smaller chunks for better processing:
```bash
# Reduce chunk size for memory efficiency
pdf-to-md-llm convert textbook.pdf --vision --vision-pages-per-chunk 4
```
### Scenario 3: Documents with Charts and Diagrams
For PDFs containing visual elements like charts, graphs, or diagrams:
```bash
# Vision mode analyzes images and describes visual content
pdf-to-md-llm convert annual-report.pdf --vision --vision-dpi 200
```
### Scenario 4: Using OpenAI GPT Models
Switch to OpenAI for different model capabilities:
```bash
# Use GPT-4o for conversion
pdf-to-md-llm convert document.pdf --provider openai --model gpt-4o --vision
# Use GPT-4o-mini for cost savings
pdf-to-md-llm convert document.pdf --provider openai --model gpt-4o-mini
```
### Scenario 5: Batch Processing Multiple Documents
Convert entire folders of PDFs:
```bash
# Convert all PDFs in a folder (single-threaded)
pdf-to-md-llm batch ./research-papers
# With custom output folder and vision mode
pdf-to-md-llm batch ./input-pdfs ./output-markdown --vision
# Batch with OpenAI
pdf-to-md-llm batch ./pdfs --provider openai --vision
# Use multithreading for faster batch conversion (2 threads)
pdf-to-md-llm batch ./pdfs --threads 2 --vision
# Use 4 threads for even faster processing
pdf-to-md-llm batch ./pdfs --threads 4 --vision
# Maximum parallelization (be mindful of API rate limits)
pdf-to-md-llm batch ./large-batch --threads 8
```
**Multithreading Benefits:**
- Dramatically reduces total conversion time for large batches
- Efficiently utilizes multi-core processors
- Thread count can be adjusted based on system resources and API rate limits
- Default is single-threaded (1 thread) to avoid rate limit issues
### Scenario 6: Simple Text Documents
For PDFs with simple text layout (no tables or complex formatting), standard mode is faster and more cost-effective:
```bash
# Standard mode (no vision) - faster and cheaper
pdf-to-md-llm convert simple-doc.pdf
# Adjust chunk size for standard mode
pdf-to-md-llm convert simple-doc.pdf --pages-per-chunk 10
```
### Getting Help
```bash
# Check the installed version
pdf-to-md-llm --version
# Show all available options
pdf-to-md-llm --help
# Show help for specific commands
pdf-to-md-llm convert --help
pdf-to-md-llm batch --help
pdf-to-md-llm models --help
```
### List Available Models
Check which AI models are available from your configured providers:
```bash
# List all available models from all configured providers
pdf-to-md-llm models
# List models from a specific provider
pdf-to-md-llm models --provider anthropic
pdf-to-md-llm models --provider openai
```
The `models` command will:
- Show available models from providers that have API keys configured
- Display the default model for each provider
- Only query providers with valid API keys in your environment
## Using as a Python Library
First, add the package to your project:
```bash
# Using uv (recommended)
uv add pdf-to-md-llm
# Or using pip
pip install pdf-to-md-llm
```
Then import and use in your Python code:
```python
from pdf_to_md_llm import convert_pdf_to_markdown, batch_convert
# Convert with vision mode (recommended for complex layouts)
markdown_content = convert_pdf_to_markdown(
pdf_path="document.pdf",
output_path="output.md", # Optional
provider="anthropic", # 'anthropic' or 'openai'
use_vision=True, # Enable vision mode
pages_per_chunk=8, # Pages per chunk (vision default: 8)
verbose=True # Show progress
)
# Use OpenAI with custom model
markdown_content = convert_pdf_to_markdown(
pdf_path="document.pdf",
provider="openai",
model="gpt-4o",
use_vision=True,
api_key="your-openai-key" # Optional if env var set
)
# Batch convert all PDFs in a folder
batch_convert(
input_folder="./pdfs",
output_folder="./markdown", # Optional
provider="anthropic",
use_vision=True,
verbose=True
)
# Batch convert with multithreading for faster processing
batch_convert(
input_folder="./pdfs",
output_folder="./markdown",
provider="anthropic",
use_vision=True,
threads=4, # Use 4 threads for parallel processing
verbose=True
)
```
### Advanced Library Usage
```python
from pdf_to_md_llm import extract_text_from_pdf, extract_pages_with_vision, chunk_pages
# Extract text only (standard mode)
pages = extract_text_from_pdf("document.pdf")
print(f"Found {len(pages)} pages")
# Extract with vision data (text + images)
vision_pages = extract_pages_with_vision("document.pdf", dpi=150)
for page in vision_pages:
print(f"Page {page['page_num']}: has_tables={page['has_tables']}, has_images={page['has_images']}")
# Create custom chunks
chunks = chunk_pages(pages, pages_per_chunk=5)
print(f"Created {len(chunks)} chunks")
```
## How It Works
### Standard Mode
1. **Text Extraction**: Extracts text from PDF using PyMuPDF
2. **Chunking**: Breaks content into manageable chunks (default: 5 pages per chunk)
3. **LLM Processing**: Sends each chunk to your chosen AI provider for intelligent markdown conversion
4. **Reassembly**: Combines all chunks into a single, formatted markdown document
### Vision Mode (Recommended)
1. **Multimodal Extraction**: Extracts both text and renders page images from PDF
2. **Smart Chunking**: Groups pages into larger chunks (default: 8 pages) for better context
3. **Visual Analysis**: AI analyzes both text and images for superior layout understanding
4. **Enhanced Accuracy**: Better detection of tables, charts, diagrams, and complex layouts
5. **Reassembly**: Combines chunks with intelligent deduplication of headers/footers
**When to use Vision Mode:**
- Documents with tables or complex layouts
- PDFs containing charts, diagrams, or visual elements
- Academic papers or technical documentation
- Any document where layout matters
## Performance Tips
### Choosing Between Standard and Vision Mode
**Use Vision Mode when:**
- PDF contains tables, charts, or diagrams
- Layout and formatting are important
- You need accurate table extraction
- Document has complex multi-column layouts
**Use Standard Mode when:**
- Simple text-only documents
- Speed and cost are priorities
- Document has straightforward single-column layout
### Chunk Size Optimization
**Larger chunks (8-10 pages):**
- Better context for the AI model
- More efficient API usage
- Better for documents with consistent formatting
- Default for vision mode
**Smaller chunks (3-5 pages):**
- Better for very large documents (500+ pages)
- Reduces memory usage
- Helpful when hitting API token limits
- Default for standard mode
### Vision Mode Settings
**DPI Settings:**
- Default (150 DPI): Good balance of quality and performance
- High quality (200-300 DPI): For small text or detailed diagrams
- Lower (100 DPI): Faster processing, suitable for simple layouts
**Adjusting chunk size in vision mode:**
```bash
# Smaller chunks for very large documents
pdf-to-md-llm convert large.pdf --vision --vision-pages-per-chunk 4
# Larger chunks for better context
pdf-to-md-llm convert doc.pdf --vision --vision-pages-per-chunk 12
```
## Troubleshooting
### API Key Errors
**Error:** `ValueError: API key not found`
**Solution:**
- Verify your API key is set in environment variables
- Check the key name matches your provider (ANTHROPIC_API_KEY or OPENAI_API_KEY)
- Ensure the key is valid and not expired
### Rate Limiting
**Error:** API rate limit exceeded
**Solution:**
- Reduce chunk size to make smaller API requests
- Add delays between batch conversions
- Upgrade your API plan for higher limits
- Switch providers if one is experiencing issues
### Large File Issues
**Error:** Memory errors or timeouts on large PDFs
**Solution:**
- Use smaller chunk sizes: `--vision-pages-per-chunk 3`
- Process in batches by splitting the PDF first
- Use standard mode instead of vision for simple documents
- Increase available system memory
### Vision Mode Memory Issues
**Error:** Out of memory when using vision mode
**Solution:**
- Reduce DPI: `--vision-dpi 100`
- Use smaller chunks: `--vision-pages-per-chunk 4`
- Process fewer pages at once
- Close other applications to free memory
### Poor Quality Output
**Problem:** Markdown output has formatting issues
**Solution:**
- Try vision mode for better layout detection: `--vision`
- Increase DPI for better image quality: `--vision-dpi 200`
- Try different models: `--provider openai --model gpt-4o`
- Adjust chunk size for better context
## API Reference
### Main Functions
#### `convert_pdf_to_markdown()`
```python
def convert_pdf_to_markdown(
pdf_path: str,
output_path: Optional[str] = None,
pages_per_chunk: int = 5,
provider: str = "anthropic",
api_key: Optional[str] = None,
model: Optional[str] = None,
max_tokens: int = 4000,
verbose: bool = True,
use_vision: bool = False,
vision_dpi: int = 150
) -> str
```
Convert a single PDF to markdown.
**Parameters:**
- `pdf_path`: Path to the PDF file
- `output_path`: Optional output file path (defaults to PDF name with .md extension)
- `pages_per_chunk`: Number of pages per API call (default: 5 for standard, 8 for vision)
- `provider`: AI provider - 'anthropic' or 'openai' (default: 'anthropic')
- `api_key`: API key (defaults to provider-specific environment variable)
- `model`: Model to use (optional, uses provider defaults)
- `max_tokens`: Maximum tokens per API call (default: 4000)
- `verbose`: Print progress messages (default: True)
- `use_vision`: Enable vision mode for better extraction (default: False)
- `vision_dpi`: DPI for page images in vision mode (default: 150)
**Returns:** The complete markdown content as a string
**Raises:** `ValueError` if API key is missing or provider is invalid
#### `batch_convert()`
```python
def batch_convert(
input_folder: str,
output_folder: Optional[str] = None,
pages_per_chunk: int = 5,
provider: str = "anthropic",
api_key: Optional[str] = None,
model: Optional[str] = None,
max_tokens: int = 4000,
verbose: bool = True,
use_vision: bool = False,
vision_dpi: int = 150,
threads: int = 1
) -> None
```
Convert all PDFs in a folder to markdown.
**Parameters:**
- `input_folder`: Folder containing PDF files
- `output_folder`: Optional output folder (defaults to input folder)
- `threads`: Number of threads for parallel processing (default: 1 for single-threaded)
- All other parameters same as `convert_pdf_to_markdown()`
**Note on Multithreading:**
- Single-threaded (`threads=1`): Default, sequential processing
- Multithreaded (`threads>1`): Parallel processing for faster batch conversion
- Be mindful of API rate limits when using higher thread counts
- Progress output is simplified in multithreaded mode for clarity
#### `extract_text_from_pdf()`
```python
def extract_text_from_pdf(pdf_path: str) -> List[str]
```
Extract raw text from PDF (standard mode).
**Returns:** List of strings, one per page
#### `extract_pages_with_vision()`
```python
def extract_pages_with_vision(pdf_path: str, dpi: int = 150) -> List[Dict[str, Any]]
```
Extract text and images from PDF pages for vision processing.
**Returns:** List of dicts with keys: `page_num`, `text`, `image_base64`, `has_images`, `has_tables`
#### `chunk_pages()`
```python
def chunk_pages(pages: List[str], pages_per_chunk: int) -> List[str]
```
Combine pages into chunks for processing.
**Returns:** List of combined page chunks
## Output Format
Converted markdown files include:
- Document title header
- Clean heading hierarchy
- Properly formatted tables
- Organized lists
- Removed page numbers and PDF artifacts
- Conversion metadata footer
## Requirements
- Python 3.9 or higher
- API key for at least one provider:
- Anthropic API key (for Claude models)
- OpenAI API key (for GPT models)
## Dependencies
All dependencies are automatically installed:
- **anthropic**: Claude API client (for Anthropic provider)
- **openai**: OpenAI API client (for OpenAI provider)
- **pymupdf**: PDF text and image extraction
- **python-dotenv**: Environment variable management
- **click**: CLI framework
## License
This project is open source and available under the MIT License.
## Contributing
Contributions are welcome! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for development setup, testing, and contribution guidelines.
For bug reports and feature requests, please open an issue on [GitHub](https://github.com/densom/pdf-to-md-llm/issues).
Raw data
{
"_id": null,
"home_page": null,
"name": "pdf-to-md-llm",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": "ai, anthropic, converter, llm, markdown, openai, pdf",
"author": null,
"author_email": "Dennis Somerville <densom@users.noreply.github.com>",
"download_url": "https://files.pythonhosted.org/packages/54/7f/ee89aa81cad69a8d799ff137daf6e23446275de0f34acb60899dac6826e4/pdf_to_md_llm-2.2.1.tar.gz",
"platform": null,
"description": "# PDF to Markdown Converter\n\nLibrary and CLI to convert PDF documents to clean, well-structured Markdown using LLM-assisted processing, leveraging Antrhopic and OpenAI models for intelligent extraction of text, tables, and complex layouts.\n\n## Features\n\n- **Vision Mode**: Enhanced extraction using multimodal AI for complex layouts, tables, charts, and diagrams\n- **Multi-Provider Support**: Use Anthropic (Claude) or OpenAI (GPT) models\n- **Smart Conversion**: Intelligently converts PDF content to clean markdown with proper formatting\n- **Large File Support**: Automatically chunks large PDFs for optimal processing\n- **Batch Processing**: Convert entire folders of PDFs with preserved directory structure\n- **Table Preservation**: Accurately converts tables to markdown format with vision-enhanced detection\n- **Structure Detection**: Automatically generates appropriate heading hierarchy\n- **Dual Interface**: Use as both a CLI tool and a Python library\n\n## Quick Start\n\n```bash\n# 1. Install with uv (recommended - faster)\nuv tool install pdf-to-md-llm\n\n# 2. Set your API key\nexport ANTHROPIC_API_KEY='your-api-key-here'\n\n# 3. Convert a PDF\npdf-to-md-llm convert document.pdf --vision\n```\n\n## Installation\n\n### Using uv (Recommended)\n\n[uv](https://github.com/astral-sh/uv) is a fast Python package installer:\n\n```bash\n# Install the package as a tool\nuv tool install pdf-to-md-llm\n\n# Or run directly without installing\nuvx pdf-to-md-llm convert document.pdf\n```\n\n### Using pip (Alternative)\n\n```bash\npip install pdf-to-md-llm\n```\n\n## Configuration\n\nSet your API key for at least one provider:\n\n```bash\n# For Anthropic (Claude) - recommended\nexport ANTHROPIC_API_KEY='your-anthropic-api-key-here'\n\n# For OpenAI (GPT)\nexport OPENAI_API_KEY='your-openai-api-key-here'\n```\n\nOr create a `.env.local` file:\n\n```bash\nANTHROPIC_API_KEY=your-anthropic-api-key-here\nOPENAI_API_KEY=your-openai-api-key-here\n```\n\n### Default Models (Optimized for Cost/Quality)\n\nThe tool uses cost-effective models by default:\n- **Anthropic**: `claude-3-5-haiku-20241022` ($0.80 input / $4 output per million tokens)\n- **OpenAI**: `gpt-4o-mini` ($0.15 input / $0.60 output per million tokens)\n\nThese defaults provide excellent quality for most PDF conversion tasks at significantly lower cost. For complex documents requiring maximum accuracy, you can override with premium models:\n\n```bash\n# Use more powerful Anthropic model for complex documents\npdf-to-md-llm convert complex-doc.pdf --model claude-sonnet-4-20250514 --vision\n\n# Use OpenAI's flagship model\npdf-to-md-llm convert complex-doc.pdf --provider openai --model gpt-4o --vision\n```\n\nTo see all available models from your configured providers, see [List Available Models](#list-available-models).\n\n## Usage Examples\n\n### Basic Conversion\n\n```bash\n# Simple document conversion\npdf-to-md-llm convert document.pdf\n\n# Specify output filename\npdf-to-md-llm convert document.pdf output.md\n```\n\n### Scenario 1: Academic Papers with Tables\n\nFor research papers, technical documents, or any PDF with complex tables:\n\n```bash\n# Vision mode provides superior table extraction\npdf-to-md-llm convert research-paper.pdf --vision\n```\n\n### Scenario 2: Large Documents (500+ pages)\n\nFor textbooks, manuals, or large documents, use smaller chunks for better processing:\n\n```bash\n# Reduce chunk size for memory efficiency\npdf-to-md-llm convert textbook.pdf --vision --vision-pages-per-chunk 4\n```\n\n### Scenario 3: Documents with Charts and Diagrams\n\nFor PDFs containing visual elements like charts, graphs, or diagrams:\n\n```bash\n# Vision mode analyzes images and describes visual content\npdf-to-md-llm convert annual-report.pdf --vision --vision-dpi 200\n```\n\n### Scenario 4: Using OpenAI GPT Models\n\nSwitch to OpenAI for different model capabilities:\n\n```bash\n# Use GPT-4o for conversion\npdf-to-md-llm convert document.pdf --provider openai --model gpt-4o --vision\n\n# Use GPT-4o-mini for cost savings\npdf-to-md-llm convert document.pdf --provider openai --model gpt-4o-mini\n```\n\n### Scenario 5: Batch Processing Multiple Documents\n\nConvert entire folders of PDFs:\n\n```bash\n# Convert all PDFs in a folder (single-threaded)\npdf-to-md-llm batch ./research-papers\n\n# With custom output folder and vision mode\npdf-to-md-llm batch ./input-pdfs ./output-markdown --vision\n\n# Batch with OpenAI\npdf-to-md-llm batch ./pdfs --provider openai --vision\n\n# Use multithreading for faster batch conversion (2 threads)\npdf-to-md-llm batch ./pdfs --threads 2 --vision\n\n# Use 4 threads for even faster processing\npdf-to-md-llm batch ./pdfs --threads 4 --vision\n\n# Maximum parallelization (be mindful of API rate limits)\npdf-to-md-llm batch ./large-batch --threads 8\n```\n\n**Multithreading Benefits:**\n- Dramatically reduces total conversion time for large batches\n- Efficiently utilizes multi-core processors\n- Thread count can be adjusted based on system resources and API rate limits\n- Default is single-threaded (1 thread) to avoid rate limit issues\n\n### Scenario 6: Simple Text Documents\n\nFor PDFs with simple text layout (no tables or complex formatting), standard mode is faster and more cost-effective:\n\n```bash\n# Standard mode (no vision) - faster and cheaper\npdf-to-md-llm convert simple-doc.pdf\n\n# Adjust chunk size for standard mode\npdf-to-md-llm convert simple-doc.pdf --pages-per-chunk 10\n```\n\n### Getting Help\n\n```bash\n# Check the installed version\npdf-to-md-llm --version\n\n# Show all available options\npdf-to-md-llm --help\n\n# Show help for specific commands\npdf-to-md-llm convert --help\npdf-to-md-llm batch --help\npdf-to-md-llm models --help\n```\n\n### List Available Models\n\nCheck which AI models are available from your configured providers:\n\n```bash\n# List all available models from all configured providers\npdf-to-md-llm models\n\n# List models from a specific provider\npdf-to-md-llm models --provider anthropic\npdf-to-md-llm models --provider openai\n```\n\nThe `models` command will:\n- Show available models from providers that have API keys configured\n- Display the default model for each provider\n- Only query providers with valid API keys in your environment\n\n## Using as a Python Library\n\nFirst, add the package to your project:\n\n```bash\n# Using uv (recommended)\nuv add pdf-to-md-llm\n\n# Or using pip\npip install pdf-to-md-llm\n```\n\nThen import and use in your Python code:\n\n```python\nfrom pdf_to_md_llm import convert_pdf_to_markdown, batch_convert\n\n# Convert with vision mode (recommended for complex layouts)\nmarkdown_content = convert_pdf_to_markdown(\n pdf_path=\"document.pdf\",\n output_path=\"output.md\", # Optional\n provider=\"anthropic\", # 'anthropic' or 'openai'\n use_vision=True, # Enable vision mode\n pages_per_chunk=8, # Pages per chunk (vision default: 8)\n verbose=True # Show progress\n)\n\n# Use OpenAI with custom model\nmarkdown_content = convert_pdf_to_markdown(\n pdf_path=\"document.pdf\",\n provider=\"openai\",\n model=\"gpt-4o\",\n use_vision=True,\n api_key=\"your-openai-key\" # Optional if env var set\n)\n\n# Batch convert all PDFs in a folder\nbatch_convert(\n input_folder=\"./pdfs\",\n output_folder=\"./markdown\", # Optional\n provider=\"anthropic\",\n use_vision=True,\n verbose=True\n)\n\n# Batch convert with multithreading for faster processing\nbatch_convert(\n input_folder=\"./pdfs\",\n output_folder=\"./markdown\",\n provider=\"anthropic\",\n use_vision=True,\n threads=4, # Use 4 threads for parallel processing\n verbose=True\n)\n```\n\n### Advanced Library Usage\n\n```python\nfrom pdf_to_md_llm import extract_text_from_pdf, extract_pages_with_vision, chunk_pages\n\n# Extract text only (standard mode)\npages = extract_text_from_pdf(\"document.pdf\")\nprint(f\"Found {len(pages)} pages\")\n\n# Extract with vision data (text + images)\nvision_pages = extract_pages_with_vision(\"document.pdf\", dpi=150)\nfor page in vision_pages:\n print(f\"Page {page['page_num']}: has_tables={page['has_tables']}, has_images={page['has_images']}\")\n\n# Create custom chunks\nchunks = chunk_pages(pages, pages_per_chunk=5)\nprint(f\"Created {len(chunks)} chunks\")\n```\n\n## How It Works\n\n### Standard Mode\n\n1. **Text Extraction**: Extracts text from PDF using PyMuPDF\n2. **Chunking**: Breaks content into manageable chunks (default: 5 pages per chunk)\n3. **LLM Processing**: Sends each chunk to your chosen AI provider for intelligent markdown conversion\n4. **Reassembly**: Combines all chunks into a single, formatted markdown document\n\n### Vision Mode (Recommended)\n\n1. **Multimodal Extraction**: Extracts both text and renders page images from PDF\n2. **Smart Chunking**: Groups pages into larger chunks (default: 8 pages) for better context\n3. **Visual Analysis**: AI analyzes both text and images for superior layout understanding\n4. **Enhanced Accuracy**: Better detection of tables, charts, diagrams, and complex layouts\n5. **Reassembly**: Combines chunks with intelligent deduplication of headers/footers\n\n**When to use Vision Mode:**\n- Documents with tables or complex layouts\n- PDFs containing charts, diagrams, or visual elements\n- Academic papers or technical documentation\n- Any document where layout matters\n\n## Performance Tips\n\n### Choosing Between Standard and Vision Mode\n\n**Use Vision Mode when:**\n- PDF contains tables, charts, or diagrams\n- Layout and formatting are important\n- You need accurate table extraction\n- Document has complex multi-column layouts\n\n**Use Standard Mode when:**\n- Simple text-only documents\n- Speed and cost are priorities\n- Document has straightforward single-column layout\n\n### Chunk Size Optimization\n\n**Larger chunks (8-10 pages):**\n- Better context for the AI model\n- More efficient API usage\n- Better for documents with consistent formatting\n- Default for vision mode\n\n**Smaller chunks (3-5 pages):**\n- Better for very large documents (500+ pages)\n- Reduces memory usage\n- Helpful when hitting API token limits\n- Default for standard mode\n\n### Vision Mode Settings\n\n**DPI Settings:**\n- Default (150 DPI): Good balance of quality and performance\n- High quality (200-300 DPI): For small text or detailed diagrams\n- Lower (100 DPI): Faster processing, suitable for simple layouts\n\n**Adjusting chunk size in vision mode:**\n```bash\n# Smaller chunks for very large documents\npdf-to-md-llm convert large.pdf --vision --vision-pages-per-chunk 4\n\n# Larger chunks for better context\npdf-to-md-llm convert doc.pdf --vision --vision-pages-per-chunk 12\n```\n\n## Troubleshooting\n\n### API Key Errors\n\n**Error:** `ValueError: API key not found`\n\n**Solution:**\n- Verify your API key is set in environment variables\n- Check the key name matches your provider (ANTHROPIC_API_KEY or OPENAI_API_KEY)\n- Ensure the key is valid and not expired\n\n### Rate Limiting\n\n**Error:** API rate limit exceeded\n\n**Solution:**\n- Reduce chunk size to make smaller API requests\n- Add delays between batch conversions\n- Upgrade your API plan for higher limits\n- Switch providers if one is experiencing issues\n\n### Large File Issues\n\n**Error:** Memory errors or timeouts on large PDFs\n\n**Solution:**\n- Use smaller chunk sizes: `--vision-pages-per-chunk 3`\n- Process in batches by splitting the PDF first\n- Use standard mode instead of vision for simple documents\n- Increase available system memory\n\n### Vision Mode Memory Issues\n\n**Error:** Out of memory when using vision mode\n\n**Solution:**\n- Reduce DPI: `--vision-dpi 100`\n- Use smaller chunks: `--vision-pages-per-chunk 4`\n- Process fewer pages at once\n- Close other applications to free memory\n\n### Poor Quality Output\n\n**Problem:** Markdown output has formatting issues\n\n**Solution:**\n- Try vision mode for better layout detection: `--vision`\n- Increase DPI for better image quality: `--vision-dpi 200`\n- Try different models: `--provider openai --model gpt-4o`\n- Adjust chunk size for better context\n\n## API Reference\n\n### Main Functions\n\n#### `convert_pdf_to_markdown()`\n\n```python\ndef convert_pdf_to_markdown(\n pdf_path: str,\n output_path: Optional[str] = None,\n pages_per_chunk: int = 5,\n provider: str = \"anthropic\",\n api_key: Optional[str] = None,\n model: Optional[str] = None,\n max_tokens: int = 4000,\n verbose: bool = True,\n use_vision: bool = False,\n vision_dpi: int = 150\n) -> str\n```\n\nConvert a single PDF to markdown.\n\n**Parameters:**\n- `pdf_path`: Path to the PDF file\n- `output_path`: Optional output file path (defaults to PDF name with .md extension)\n- `pages_per_chunk`: Number of pages per API call (default: 5 for standard, 8 for vision)\n- `provider`: AI provider - 'anthropic' or 'openai' (default: 'anthropic')\n- `api_key`: API key (defaults to provider-specific environment variable)\n- `model`: Model to use (optional, uses provider defaults)\n- `max_tokens`: Maximum tokens per API call (default: 4000)\n- `verbose`: Print progress messages (default: True)\n- `use_vision`: Enable vision mode for better extraction (default: False)\n- `vision_dpi`: DPI for page images in vision mode (default: 150)\n\n**Returns:** The complete markdown content as a string\n\n**Raises:** `ValueError` if API key is missing or provider is invalid\n\n#### `batch_convert()`\n\n```python\ndef batch_convert(\n input_folder: str,\n output_folder: Optional[str] = None,\n pages_per_chunk: int = 5,\n provider: str = \"anthropic\",\n api_key: Optional[str] = None,\n model: Optional[str] = None,\n max_tokens: int = 4000,\n verbose: bool = True,\n use_vision: bool = False,\n vision_dpi: int = 150,\n threads: int = 1\n) -> None\n```\n\nConvert all PDFs in a folder to markdown.\n\n**Parameters:**\n- `input_folder`: Folder containing PDF files\n- `output_folder`: Optional output folder (defaults to input folder)\n- `threads`: Number of threads for parallel processing (default: 1 for single-threaded)\n- All other parameters same as `convert_pdf_to_markdown()`\n\n**Note on Multithreading:**\n- Single-threaded (`threads=1`): Default, sequential processing\n- Multithreaded (`threads>1`): Parallel processing for faster batch conversion\n- Be mindful of API rate limits when using higher thread counts\n- Progress output is simplified in multithreaded mode for clarity\n\n#### `extract_text_from_pdf()`\n\n```python\ndef extract_text_from_pdf(pdf_path: str) -> List[str]\n```\n\nExtract raw text from PDF (standard mode).\n\n**Returns:** List of strings, one per page\n\n#### `extract_pages_with_vision()`\n\n```python\ndef extract_pages_with_vision(pdf_path: str, dpi: int = 150) -> List[Dict[str, Any]]\n```\n\nExtract text and images from PDF pages for vision processing.\n\n**Returns:** List of dicts with keys: `page_num`, `text`, `image_base64`, `has_images`, `has_tables`\n\n#### `chunk_pages()`\n\n```python\ndef chunk_pages(pages: List[str], pages_per_chunk: int) -> List[str]\n```\n\nCombine pages into chunks for processing.\n\n**Returns:** List of combined page chunks\n\n## Output Format\n\nConverted markdown files include:\n\n- Document title header\n- Clean heading hierarchy\n- Properly formatted tables\n- Organized lists\n- Removed page numbers and PDF artifacts\n- Conversion metadata footer\n\n## Requirements\n\n- Python 3.9 or higher\n- API key for at least one provider:\n - Anthropic API key (for Claude models)\n - OpenAI API key (for GPT models)\n\n## Dependencies\n\nAll dependencies are automatically installed:\n\n- **anthropic**: Claude API client (for Anthropic provider)\n- **openai**: OpenAI API client (for OpenAI provider)\n- **pymupdf**: PDF text and image extraction\n- **python-dotenv**: Environment variable management\n- **click**: CLI framework\n\n## License\n\nThis project is open source and available under the MIT License.\n\n## Contributing\n\nContributions are welcome! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for development setup, testing, and contribution guidelines.\n\nFor bug reports and feature requests, please open an issue on [GitHub](https://github.com/densom/pdf-to-md-llm/issues).\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Library and CLI to convert PDF documents to clean, well-structured Markdown using LLM-assisted processing, leveraging Antrhopic and OpenAI models for intelligent extraction of text, tables, and complex layouts.",
"version": "2.2.1",
"project_urls": {
"Funding": "https://buymeacoffee.com/densom",
"Homepage": "https://github.com/densom/pdf-to-md-llm",
"Issues": "https://github.com/densom/pdf-to-md-llm/issues",
"Repository": "https://github.com/densom/pdf-to-md-llm"
},
"split_keywords": [
"ai",
" anthropic",
" converter",
" llm",
" markdown",
" openai",
" pdf"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "2a6c723ea2da55df099fa25ac91638c4e5319b080f3237554903028026e8c02c",
"md5": "998f4fba85b18679fab6680cf77447ce",
"sha256": "c479c7e01ced0769ea94e4a7e352dd30c8b89fb1b5b376a70985d5d0504d75fb"
},
"downloads": -1,
"filename": "pdf_to_md_llm-2.2.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "998f4fba85b18679fab6680cf77447ce",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 20823,
"upload_time": "2025-10-07T02:11:53",
"upload_time_iso_8601": "2025-10-07T02:11:53.315713Z",
"url": "https://files.pythonhosted.org/packages/2a/6c/723ea2da55df099fa25ac91638c4e5319b080f3237554903028026e8c02c/pdf_to_md_llm-2.2.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "547fee89aa81cad69a8d799ff137daf6e23446275de0f34acb60899dac6826e4",
"md5": "fd8aea4efa4a1cb10237687755d42aa4",
"sha256": "d0156e1305570455658e406350c3bf777d4896c3f3e1f6ffd38b55af0f612719"
},
"downloads": -1,
"filename": "pdf_to_md_llm-2.2.1.tar.gz",
"has_sig": false,
"md5_digest": "fd8aea4efa4a1cb10237687755d42aa4",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 46809103,
"upload_time": "2025-10-07T02:11:57",
"upload_time_iso_8601": "2025-10-07T02:11:57.475517Z",
"url": "https://files.pythonhosted.org/packages/54/7f/ee89aa81cad69a8d799ff137daf6e23446275de0f34acb60899dac6826e4/pdf_to_md_llm-2.2.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-07 02:11:57",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "densom",
"github_project": "pdf-to-md-llm",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "pdf-to-md-llm"
}