# PDF to Markdown Converter
A Python application that leverages Large Language Models (LLMs) to accurately convert technical PDF documents into well-structured Markdown documents.
## Features
- ๐ **High-Quality Conversion**: Uses state-of-the-art LLMs for accurate text extraction
- ๐ **Table Preservation**: Converts tables to HTML or Markdown format (configurable)
- ๐ข **Equation Support**: Preserves mathematical equations in LaTeX format
- ๐ผ๏ธ **Image Handling**: Describes images and preserves captions
- โก **Parallel Processing**: Processes multiple pages concurrently for speed
- ๐ **Progress Tracking**: Clear logging of processing status
- ๐ง **Configurable**: Extensive configuration options via YAML or CLI
- ๐ **Retry Logic**: Automatic retry with exponential backoff for reliability
- โ
**Validation Pipeline**: Extensible validation system with multiple validators
- ๐ **Repetition Detection**: Automatically detects and corrects content repetition
- โ๏ธ **Markdown Validation**: Built-in syntax validation and correction using PyMarkdown
- ๐ฏ **Pure Output**: Generates only document content without additional commentary
- ๐งน **Smart Cleaning**: Automatically removes markdown code fences that LLMs sometimes add
- ๐ **Configurable Page Separators**: Customize how pages are separated in the output
## Installation
### From PyPI (Coming Soon)
```bash
pip install pdf2markdown
```
### Using Hatch (Development)
```bash
# Install Hatch
pipx install hatch
# Clone the repository
git clone https://github.com/juanqui/pdf2markdown.git
cd pdf2markdown
# Install dependencies
hatch env create
# Activate environment
hatch shell
```
### Using pip
```bash
# Clone the repository
git clone https://github.com/juanqui/pdf2markdown.git
cd pdf2markdown
# Install the package
pip install -e .
# Optional: Install with transformers support for local models
pip install -e ".[transformers]"
```
## Quick Start
1. **Set up configuration:**
```bash
# Copy the sample configuration file
cp config/default.sample.yaml config/default.yaml
# Edit the configuration file with your settings
# At minimum, update the llm_provider section with your API details
nano config/default.yaml # or use your preferred editor
```
2. **Set your API key (recommended via environment variable):**
```bash
export OPENAI_API_KEY="your-api-key-here"
```
3. **Convert a PDF:**
```bash
pdf2markdown input.pdf -o output.md
```
## Library Usage
`pdf2markdown` can be used as a Python library in your own applications. This is useful for integrating PDF conversion into larger systems, web applications, or data processing pipelines.
### Simple Usage
```python
from pdf2markdown import PDFConverter
# Create converter with default settings
converter = PDFConverter()
# Convert a PDF to markdown
markdown_text = converter.convert_sync("document.pdf")
print(markdown_text)
# Save to a file
markdown_text = converter.convert_sync("document.pdf", "output.md")
```
### Configuration Options
```python
from pdf2markdown import PDFConverter, ConfigBuilder
# Build configuration programmatically
config = ConfigBuilder() \
.with_openai(api_key="your-api-key", model="gpt-4o") \
.with_resolution(400) \
.with_page_workers(20) \
.with_cache_dir("/tmp/my_cache") \
.build()
converter = PDFConverter(config=config)
markdown = converter.convert_sync("document.pdf")
```
### Table Format Configuration
```python
from pdf2markdown import ConfigBuilder, PDFConverter
# Configure for HTML tables (better for complex layouts)
config = ConfigBuilder() \
.with_openai(api_key="your-api-key") \
.build()
# Set table format in the configuration
config['page_parser']['table_format'] = 'html' # Default
converter = PDFConverter(config=config)
# Or configure for Markdown tables (simpler format)
config['page_parser']['table_format'] = 'markdown'
```
### Using Different LLM Providers
```python
from pdf2markdown import ConfigBuilder, PDFConverter
# OpenAI (or compatible endpoints)
config = ConfigBuilder() \
.with_openai(
api_key="your-key",
model="gpt-4o-mini",
endpoint="https://api.openai.com/v1" # or your custom endpoint
) \
.build()
# Local models with Transformers
config = ConfigBuilder() \
.with_transformers(
model_name="microsoft/Phi-3.5-vision-instruct",
device="cuda", # or "cpu"
torch_dtype="float16"
) \
.build()
converter = PDFConverter(config=config)
```
### Async Usage
```python
import asyncio
from pdf2markdown import PDFConverter
async def convert_pdf():
converter = PDFConverter()
# Async conversion
markdown = await converter.convert("document.pdf")
# With progress callback
async def progress(current, total, message):
print(f"Progress: {current}/{total} - {message}")
markdown = await converter.convert(
"document.pdf",
progress_callback=progress
)
return markdown
# Run async function
markdown = asyncio.run(convert_pdf())
```
### Streaming Pages
Process large documents page by page as they complete:
```python
import asyncio
from pdf2markdown import PDFConverter
async def stream_conversion():
converter = PDFConverter()
async for page in converter.stream_pages("large_document.pdf"):
print(f"Page {page.page_number}: {len(page.content)} characters")
# Process each page as it completes
# e.g., save to database, send to queue, etc.
asyncio.run(stream_conversion())
```
### Batch Processing
Convert multiple PDFs efficiently:
```python
import asyncio
from pdf2markdown import PDFConverter
async def batch_convert():
converter = PDFConverter()
pdf_files = ["doc1.pdf", "doc2.pdf", "doc3.pdf"]
results = await converter.process_batch(
pdf_files,
output_dir="./output"
)
for result in results:
if result.status == ConversionStatus.COMPLETED:
print(f"โ {result.source_path}")
else:
print(f"โ {result.source_path}: {result.error_message}")
asyncio.run(batch_convert())
```
### Loading Configuration from Files
```python
from pdf2markdown import PDFConverter, Config
# From YAML file
config = Config.from_yaml("config.yaml")
converter = PDFConverter(config=config)
# From dictionary
config_dict = {
"llm_provider": {
"provider_type": "openai",
"api_key": "your-key",
"model": "gpt-4o-mini"
},
"pipeline": {
"page_workers": 15
}
}
converter = PDFConverter(config=config_dict)
```
### Error Handling
```python
from pdf2markdown import (
PDFConverter,
PDFConversionError,
ConfigurationError,
ParsingError
)
try:
converter = PDFConverter()
markdown = converter.convert_sync("document.pdf")
except ConfigurationError as e:
print(f"Configuration error: {e}")
except ParsingError as e:
print(f"Failed to parse PDF: {e}")
if e.page_number:
print(f"Error on page {e.page_number}")
except PDFConversionError as e:
print(f"Conversion failed: {e}")
```
### Context Manager
Properly clean up resources using context managers:
```python
import asyncio
from pdf2markdown import PDFConverter
async def convert_with_cleanup():
async with PDFConverter() as converter:
markdown = await converter.convert("document.pdf")
# Converter automatically cleaned up after this block
return markdown
markdown = asyncio.run(convert_with_cleanup())
```
### Integration Examples
#### Flask Web Application
```python
from flask import Flask, request, jsonify
from pdf2markdown import PDFConverter
app = Flask(__name__)
converter = PDFConverter()
@app.route('/convert', methods=['POST'])
def convert_pdf():
if 'file' not in request.files:
return jsonify({'error': 'No file provided'}), 400
file = request.files['file']
file.save('/tmp/upload.pdf')
try:
markdown = converter.convert_sync('/tmp/upload.pdf')
return jsonify({'markdown': markdown})
except Exception as e:
return jsonify({'error': str(e)}), 500
```
#### Celery Task Queue
```python
from celery import Celery
from pdf2markdown import PDFConverter
app = Celery('tasks', broker='redis://localhost:6379')
converter = PDFConverter()
@app.task
def convert_pdf_task(pdf_path):
"""Background task to convert PDF"""
return converter.convert_sync(pdf_path)
```
#### Document Processing Pipeline
```python
from pdf2markdown import PDFConverter, ConfigBuilder
import sqlite3
# Configure for high-quality conversion
config = ConfigBuilder() \
.with_openai(api_key="your-key", model="gpt-4o") \
.with_resolution(400) \
.with_validators(['markdown', 'repetition']) \
.build()
converter = PDFConverter(config=config)
def process_document(pdf_path, doc_id):
"""Process document and store in database"""
# Convert PDF
markdown = converter.convert_sync(pdf_path)
# Store in database
conn = sqlite3.connect('documents.db')
cursor = conn.cursor()
cursor.execute(
"INSERT INTO documents (id, content) VALUES (?, ?)",
(doc_id, markdown)
)
conn.commit()
conn.close()
return doc_id
```
## CLI Usage
### Basic Usage
```bash
# Convert with default settings
pdf2markdown document.pdf
# Specify output file
pdf2markdown document.pdf -o converted.md
# Use a specific model
pdf2markdown document.pdf --model gpt-4o
# Adjust rendering resolution
pdf2markdown document.pdf --resolution 400
```
### Advanced Usage
```bash
# Use custom configuration file
pdf2markdown document.pdf --config my-config.yaml
# Parallel processing with more workers
pdf2markdown document.pdf --page-workers 20
# Disable progress logging for automation
pdf2markdown document.pdf --no-progress
# Save configuration for reuse
pdf2markdown document.pdf --save-config my-settings.yaml
# Specify table format (html or markdown)
pdf2markdown document.pdf --table-format html # For complex tables
pdf2markdown document.pdf --table-format markdown # For simple tables
```
### Configuration
#### Initial Setup
The application uses a YAML configuration file to manage settings. To get started:
1. **Copy the sample configuration:**
```bash
cp config/default.sample.yaml config/default.yaml
```
2. **Review and edit the configuration:**
The sample file (`config/default.sample.yaml`) is heavily documented with explanations for every setting. Key sections to configure:
- `llm_provider`: Your LLM API settings (endpoint, API key, model)
- `document_parser`: PDF rendering settings
- `pipeline`: Worker and processing settings
3. **Set sensitive values via environment variables:**
Instead of hardcoding API keys in the config file, use environment variables:
```bash
export OPENAI_API_KEY="your-api-key-here"
```
Then reference it in your config as: `${OPENAI_API_KEY}`
#### Configuration File Structure
Here's an overview of the configuration structure:
```yaml
# LLM Provider Configuration (required)
llm_provider:
provider_type: openai # Provider type (currently supports "openai")
endpoint: https://api.openai.com/v1 # API endpoint URL
api_key: ${OPENAI_API_KEY} # Can reference environment variables
model: gpt-4o-mini # Model to use
max_tokens: 4096 # Maximum tokens in response
temperature: 0.1 # Generation temperature (0.0-2.0)
timeout: 60 # Request timeout in seconds
# Penalty parameters to reduce repetition (all optional)
presence_penalty: 0.0 # Penalize tokens based on presence (-2.0 to 2.0)
frequency_penalty: 0.0 # Penalize tokens based on frequency (-2.0 to 2.0)
repetition_penalty: null # Alternative repetition penalty (0.0 to 2.0, some providers only)
# Document Parser Configuration
document_parser:
type: simple # Parser type
resolution: 300 # DPI for rendering PDF pages to images
cache_dir: /tmp/pdf2markdown/cache # Cache directory for rendered images
max_page_size: 50000000 # Maximum page size in bytes (50MB)
timeout: 30 # Timeout for rendering operations
# Page Parser Configuration
page_parser:
type: simple_llm # Parser type
prompt_template: null # Optional custom prompt template path
additional_instructions: null # Optional additional LLM instructions
# Table format configuration
table_format: html # 'html' for complex layouts, 'markdown' for simple tables
# Content validation pipeline configuration
validate_content: true # Enable content validation
validation:
# List of validators to run (in order)
validators: ["markdown", "repetition"]
# Maximum number of correction attempts
max_correction_attempts: 2
# Markdown validator - checks syntax and formatting
markdown:
enabled: true # Enable this validator
attempt_correction: true # Try to fix issues by re-prompting LLM
strict_mode: false # Use relaxed mode for LLM-generated content
max_line_length: 1000 # Max line length (MD013 rule)
disabled_rules: [] # Additional rules to disable
enabled_rules: [] # Specific rules to enable
# Note: Common overly-strict rules are disabled by default including:
# MD033 (Inline HTML) - common in technical documents and tables
# MD026 (Trailing punctuation in headings) - common in PDF headings
# MD042 (No empty links) - LLMs may generate placeholder links during extraction
# MD041, MD022, MD031, MD032, MD025, MD024, MD013, MD047, MD040
# Repetition validator - detects and corrects unwanted repetition
repetition:
enabled: true # Enable this validator
attempt_correction: true # Try to fix repetition issues
consecutive_threshold: 3 # Flag 3+ consecutive duplicate lines
window_size: 10 # Check within 10-line windows
window_threshold: 3 # Flag 3+ occurrences in window
check_exact_lines: true # Check for exact duplicates
check_normalized_lines: true # Check ignoring whitespace/punctuation
check_paragraphs: true # Check for duplicate paragraphs
check_patterns: true # Detect repetitive patterns
min_pattern_length: 20 # Minimum chars for pattern detection
pattern_similarity_threshold: 0.9 # Similarity threshold (0-1)
min_line_length: 5 # Minimum line length to check
# Pipeline Configuration
pipeline:
document_workers: 1 # Must be 1 for sequential document processing
page_workers: 10 # Number of parallel page processing workers
queues:
document_queue_size: 100
page_queue_size: 1000
output_queue_size: 500
enable_progress: true # Show progress bars
log_level: INFO # Logging level
# Output Configuration
output_dir: ./output # Default output directory
temp_dir: /tmp/pdf2markdown # Temporary file directory
page_separator: "\n\n--[PAGE: {page_number}]--\n\n" # Separator between pages
```
#### Configuration Hierarchy
Configuration values are loaded in the following order (later values override earlier ones):
1. Default values in code
2. Configuration file (`config/default.yaml` or file specified via `--config`)
3. Environment variables
4. Command-line arguments
**Note:** The application looks for `config/default.yaml` in the current working directory by default. You can specify a different configuration file using the `--config` option:
```bash
pdf2markdown input.pdf --config /path/to/my-config.yaml
```
#### LLM Provider Configuration
The `llm_provider` section is shared across all components that need LLM access. This centralized configuration makes it easy to:
- Switch between different LLM providers
- Use the same provider settings for multiple components
- Override settings globally via environment variables or CLI
**Supported Providers:**
- `openai`: Any OpenAI-compatible API (OpenAI, Azure OpenAI, local servers with OpenAI-compatible endpoints)
- `transformers`: Local models using HuggingFace Transformers (requires optional dependencies)
**Future Providers (planned):**
- `ollama`: Local models via Ollama
- `anthropic`: Anthropic Claude API
- `google`: Google Gemini API
##### Penalty Parameters for Reducing Repetition
To avoid repetitive text in the generated markdown, you can configure penalty parameters:
- **presence_penalty** (-2.0 to 2.0): Penalizes tokens that have already appeared in the text. Positive values discourage repetition.
- **frequency_penalty** (-2.0 to 2.0): Penalizes tokens based on their frequency in the text so far. Positive values reduce repetition of common phrases.
- **repetition_penalty** (0.0 to 2.0): Alternative parameter used by some providers (e.g., local models). Values > 1.0 reduce repetition.
**Recommended settings for reducing repetition:**
```yaml
llm_provider:
presence_penalty: 0.5
frequency_penalty: 0.5
# OR for providers that use repetition_penalty:
repetition_penalty: 1.15
```
#### Custom OpenAI-Compatible Endpoints
To use a custom OpenAI-compatible endpoint (e.g., local LLM server, vLLM, etc.):
```yaml
llm_provider:
provider_type: openai
endpoint: http://localhost:8080/v1 # Your custom endpoint
api_key: dummy-key # Some endpoints require a placeholder
model: your-model-name
max_tokens: 8192
temperature: 0.7
timeout: 120
```
#### Using Local Models with Transformers
The Transformers provider allows you to run models locally using HuggingFace Transformers. This is useful for:
- Running without API costs
- Processing sensitive documents locally
- Using specialized models not available via APIs
- Running on systems with GPU acceleration
**Installation:**
```bash
# Install with transformers support
pip install -e ".[transformers]"
```
**Configuration Example:**
```yaml
llm_provider:
provider_type: transformers
model_name: "openbmb/MiniCPM-V-4" # HuggingFace model ID
device: "auto" # or "cuda", "cpu", "cuda:0", etc.
torch_dtype: "bfloat16" # or "float16", "float32", "auto"
max_tokens: 4096
temperature: 0.1
do_sample: false
# Optional: Use 4-bit quantization to save memory
load_in_4bit: true
# Optional: For models with .chat() method
use_chat_method: true
```
**Supported Models (examples):**
- **MiniCPM-V series**: `openbmb/MiniCPM-V-4`, `openbmb/MiniCPM-V-2_6`
- **Nanonets OCR**: `nanonets/Nanonets-OCR-s`
- **Other vision models**: Any model supporting image-text-to-text generation
**Performance Tips:**
- Use `load_in_4bit: true` or `load_in_8bit: true` to reduce memory usage
- Set `page_workers: 1` in pipeline config for local models (they use more memory)
- Use `device_map: "auto"` for multi-GPU systems
- Consider using `attn_implementation: "flash_attention_2"` for faster inference (if supported)
See `config/transformers_example.yaml` for a complete configuration example.
## Environment Variables
### LLM Provider Variables
- `OPENAI_API_KEY`: Your OpenAI API key (required)
- `OPENAI_API_ENDPOINT`: Custom API endpoint URL (optional)
- `OPENAI_MODEL`: Model to use (default: gpt-4o-mini)
### Application Variables
- `PDF2MARKDOWN_CACHE_DIR`: Cache directory for rendered images
- `PDF2MARKDOWN_OUTPUT_DIR`: Default output directory
- `PDF2MARKDOWN_LOG_LEVEL`: Logging level (DEBUG, INFO, WARNING, ERROR)
- `PDF2MARKDOWN_TEMP_DIR`: Temporary file directory
## How It Works
1. **Document Parsing**: PDF pages are rendered as high-resolution images using PyMuPDF
2. **LLM Provider**: The configured LLM provider handles communication with the AI model
3. **Image Processing**: Each page image is sent to the LLM with vision capabilities
4. **Content Extraction**: The LLM extracts and formats content as Markdown
5. **Validation Pipeline**: Content passes through multiple validators:
- **Markdown Validator**: Checks syntax and formatting
- **Repetition Validator**: Detects unwanted repetition patterns
6. **Correction** (optional): If issues are found, the LLM is re-prompted with specific instructions to fix them
7. **Assembly**: Processed pages are combined into a single Markdown document
### Architecture Overview
The application uses a modular architecture with these key components:
- **LLM Provider**: Abstraction layer for different LLM services (OpenAI, local models, etc.)
- **Document Parser**: Converts PDF pages to images
- **Page Parser**: Converts images to Markdown using LLM
- **Validation Pipeline**: Extensible system with multiple validators:
- **Markdown Validator**: Validates and corrects syntax issues
- **Repetition Validator**: Detects and corrects unwanted repetition
- Easily extensible for additional validators
- **Pipeline**: Orchestrates the conversion process with parallel workers
- **Queue System**: Manages work distribution across workers
## Output Format
The converter preserves:
- **Headers**: Converted to appropriate Markdown heading levels
- **Tables**: Rendered as HTML tables or Markdown tables (configurable)
- **Lists**: Both ordered and unordered lists
- **Equations**: LaTeX format for mathematical expressions ($inline$ and $$display$$)
- **Images**: Descriptions or captions preserved
- **Formatting**: Bold, italic, code, and other text styling
- **Technical Elements**: Pin diagrams, electrical characteristics, timing specifications
- **Special Notations**: Notes, warnings, footnotes, and cross-references
### Table Format Options
The converter supports two table formats, configurable via the `table_format` setting:
#### HTML Tables (Default)
HTML tables are recommended for complex layouts with:
- Merged cells (colspan/rowspan)
- Nested tables
- Complex alignments
- Multi-line cell content
Example configuration:
```yaml
page_parser:
table_format: html # Default setting
```
Output example:
```html
<table>
<thead>
<tr>
<th rowspan="2">Parameter</th>
<th colspan="3">Conditions</th>
<th>Unit</th>
</tr>
<tr>
<th>Min</th>
<th>Typ</th>
<th>Max</th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>Operating Voltage</td>
<td>1.7</td>
<td>3.3</td>
<td>3.6</td>
<td>V</td>
</tr>
</tbody>
</table>
```
#### Markdown Tables
Markdown tables are simpler and more readable in plain text, best for:
- Simple tabular data
- Tables without merged cells
- Basic alignment needs
Example configuration:
```yaml
page_parser:
table_format: markdown
```
Output example:
```markdown
| Parameter | Min | Typ | Max | Unit |
|-----------|----:|----:|----:|------|
| Voltage | 1.7 | 3.3 | 3.6 | V |
| Current | 0.1 | 0.5 | 1.0 | mA |
```
### Output Quality
The converter ensures high-quality output through multiple mechanisms:
#### Output Purity
- Outputs **ONLY** the content from the PDF document
- No explanatory text or comments
- No "Here is the content" preambles
- No additional formatting suggestions
- Automatically removes markdown code fences if LLM wraps output
- Just clean, accurate Markdown representing the original document
#### Validation Pipeline
- **Syntax Validation**: Ensures proper markdown formatting
- **Repetition Detection**: Identifies and corrects various types of repetition:
- Consecutive duplicate lines
- Near-duplicates within sliding windows
- Duplicate paragraphs
- Repetitive patterns
- **Extensible System**: Easy to add custom validators for specific needs
### Page Separation
Pages are separated using a configurable separator (default: `--[PAGE: N]--`). You can customize this in the configuration:
```yaml
# Examples of page separators:
page_separator: "\n---\n" # Simple horizontal rule
page_separator: "\n\n<!-- Page {page_number} -->\n\n" # HTML comment (invisible)
page_separator: "\n\n# Page {page_number}\n\n" # Markdown heading
page_separator: "\n\n--[PAGE: {page_number}]--\n\n" # Default format
```
## Performance
- Processes pages in parallel (default: 10 workers)
- Automatic caching of rendered images
- Typical processing: 5-10 seconds per page
## Requirements
- Python 3.10+
- OpenAI API key (or compatible endpoint)
- System dependencies for PyMuPDF
## Configuration Examples
### Using Azure OpenAI
```yaml
llm_provider:
provider_type: openai
endpoint: https://your-resource.openai.azure.com/
api_key: ${AZURE_OPENAI_KEY}
model: gpt-4-vision
max_tokens: 4096
```
### Using Local LLM Server
```yaml
llm_provider:
provider_type: openai
endpoint: http://localhost:11434/v1 # Ollama with OpenAI compatibility
api_key: not-needed
model: llava:13b
max_tokens: 8192
timeout: 300 # Longer timeout for local models
# Many local servers use repetition_penalty instead
repetition_penalty: 1.15
```
### High-Performance Configuration
```yaml
llm_provider:
provider_type: openai
endpoint: https://api.openai.com/v1
api_key: ${OPENAI_API_KEY}
model: gpt-4o
max_tokens: 8192
temperature: 0.1
# Reduce repetition for better quality output
presence_penalty: 0.5
frequency_penalty: 0.5
pipeline:
page_workers: 20 # More parallel workers for faster processing
document_parser:
resolution: 400 # Higher quality images
```
## Troubleshooting
### API Key Issues
```bash
# Verify API key is set
echo $OPENAI_API_KEY
# Set in .env file
echo "OPENAI_API_KEY=your-key" > .env
# Check configuration
pdf2markdown document.pdf --save-config debug-config.yaml
# Then inspect debug-config.yaml
```
### Memory Issues
```bash
# Reduce worker count
pdf2markdown large.pdf --page-workers 5
# Lower resolution
pdf2markdown large.pdf --resolution 200
```
### Debugging
```bash
# Enable debug logging
pdf2markdown document.pdf --log-level DEBUG
# Check cache directory
ls /tmp/pdf2markdown/cache/
```
## Development
### Running Tests
```bash
hatch run test
```
### Code Formatting
```bash
hatch run format
```
### Type Checking
```bash
hatch run typecheck
```
## License
MIT License - see LICENSE file for details
Raw data
{
"_id": null,
"home_page": null,
"name": "pdf2markdown",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": "Juan Villa <juanqui@villafam.com>",
"keywords": "ai, conversion, document-conversion, document-processing, gpt, llm, machine-learning, markdown, ocr, openai, pdf, pdf-parser, pdf-to-text, transformers, vision",
"author": null,
"author_email": "Juan Villa <juanqui@villafam.com>",
"download_url": "https://files.pythonhosted.org/packages/39/18/67c2189a8957ae328749122babc50f6ddaaed92fa0447140fe210b76b624/pdf2markdown-0.2.0.tar.gz",
"platform": null,
"description": "# PDF to Markdown Converter\n\nA Python application that leverages Large Language Models (LLMs) to accurately convert technical PDF documents into well-structured Markdown documents.\n\n## Features\n\n- \ud83d\ude80 **High-Quality Conversion**: Uses state-of-the-art LLMs for accurate text extraction\n- \ud83d\udcca **Table Preservation**: Converts tables to HTML or Markdown format (configurable)\n- \ud83d\udd22 **Equation Support**: Preserves mathematical equations in LaTeX format\n- \ud83d\uddbc\ufe0f **Image Handling**: Describes images and preserves captions\n- \u26a1 **Parallel Processing**: Processes multiple pages concurrently for speed\n- \ud83d\udcc8 **Progress Tracking**: Clear logging of processing status\n- \ud83d\udd27 **Configurable**: Extensive configuration options via YAML or CLI\n- \ud83d\udd04 **Retry Logic**: Automatic retry with exponential backoff for reliability\n- \u2705 **Validation Pipeline**: Extensible validation system with multiple validators\n- \ud83d\udd0d **Repetition Detection**: Automatically detects and corrects content repetition\n- \u2714\ufe0f **Markdown Validation**: Built-in syntax validation and correction using PyMarkdown\n- \ud83c\udfaf **Pure Output**: Generates only document content without additional commentary\n- \ud83e\uddf9 **Smart Cleaning**: Automatically removes markdown code fences that LLMs sometimes add\n- \ud83d\udcc4 **Configurable Page Separators**: Customize how pages are separated in the output\n\n## Installation\n\n### From PyPI (Coming Soon)\n\n```bash\npip install pdf2markdown\n```\n\n### Using Hatch (Development)\n\n```bash\n# Install Hatch\npipx install hatch\n\n# Clone the repository\ngit clone https://github.com/juanqui/pdf2markdown.git\ncd pdf2markdown\n\n# Install dependencies\nhatch env create\n\n# Activate environment\nhatch shell\n```\n\n### Using pip\n\n```bash\n# Clone the repository\ngit clone https://github.com/juanqui/pdf2markdown.git\ncd pdf2markdown\n\n# Install the package\npip install -e .\n\n# Optional: Install with transformers support for local models\npip install -e \".[transformers]\"\n```\n\n## Quick Start\n\n1. **Set up configuration:**\n```bash\n# Copy the sample configuration file\ncp config/default.sample.yaml config/default.yaml\n\n# Edit the configuration file with your settings\n# At minimum, update the llm_provider section with your API details\nnano config/default.yaml # or use your preferred editor\n```\n\n2. **Set your API key (recommended via environment variable):**\n```bash\nexport OPENAI_API_KEY=\"your-api-key-here\"\n```\n\n3. **Convert a PDF:**\n```bash\npdf2markdown input.pdf -o output.md\n```\n\n## Library Usage\n\n`pdf2markdown` can be used as a Python library in your own applications. This is useful for integrating PDF conversion into larger systems, web applications, or data processing pipelines.\n\n### Simple Usage\n\n```python\nfrom pdf2markdown import PDFConverter\n\n# Create converter with default settings\nconverter = PDFConverter()\n\n# Convert a PDF to markdown\nmarkdown_text = converter.convert_sync(\"document.pdf\")\nprint(markdown_text)\n\n# Save to a file\nmarkdown_text = converter.convert_sync(\"document.pdf\", \"output.md\")\n```\n\n### Configuration Options\n\n```python\nfrom pdf2markdown import PDFConverter, ConfigBuilder\n\n# Build configuration programmatically\nconfig = ConfigBuilder() \\\n .with_openai(api_key=\"your-api-key\", model=\"gpt-4o\") \\\n .with_resolution(400) \\\n .with_page_workers(20) \\\n .with_cache_dir(\"/tmp/my_cache\") \\\n .build()\n\nconverter = PDFConverter(config=config)\nmarkdown = converter.convert_sync(\"document.pdf\")\n```\n\n### Table Format Configuration\n\n```python\nfrom pdf2markdown import ConfigBuilder, PDFConverter\n\n# Configure for HTML tables (better for complex layouts)\nconfig = ConfigBuilder() \\\n .with_openai(api_key=\"your-api-key\") \\\n .build()\n\n# Set table format in the configuration\nconfig['page_parser']['table_format'] = 'html' # Default\n\nconverter = PDFConverter(config=config)\n\n# Or configure for Markdown tables (simpler format)\nconfig['page_parser']['table_format'] = 'markdown'\n```\n\n### Using Different LLM Providers\n\n```python\nfrom pdf2markdown import ConfigBuilder, PDFConverter\n\n# OpenAI (or compatible endpoints)\nconfig = ConfigBuilder() \\\n .with_openai(\n api_key=\"your-key\",\n model=\"gpt-4o-mini\",\n endpoint=\"https://api.openai.com/v1\" # or your custom endpoint\n ) \\\n .build()\n\n# Local models with Transformers\nconfig = ConfigBuilder() \\\n .with_transformers(\n model_name=\"microsoft/Phi-3.5-vision-instruct\",\n device=\"cuda\", # or \"cpu\"\n torch_dtype=\"float16\"\n ) \\\n .build()\n\nconverter = PDFConverter(config=config)\n```\n\n### Async Usage\n\n```python\nimport asyncio\nfrom pdf2markdown import PDFConverter\n\nasync def convert_pdf():\n converter = PDFConverter()\n \n # Async conversion\n markdown = await converter.convert(\"document.pdf\")\n \n # With progress callback\n async def progress(current, total, message):\n print(f\"Progress: {current}/{total} - {message}\")\n \n markdown = await converter.convert(\n \"document.pdf\",\n progress_callback=progress\n )\n \n return markdown\n\n# Run async function\nmarkdown = asyncio.run(convert_pdf())\n```\n\n### Streaming Pages\n\nProcess large documents page by page as they complete:\n\n```python\nimport asyncio\nfrom pdf2markdown import PDFConverter\n\nasync def stream_conversion():\n converter = PDFConverter()\n \n async for page in converter.stream_pages(\"large_document.pdf\"):\n print(f\"Page {page.page_number}: {len(page.content)} characters\")\n # Process each page as it completes\n # e.g., save to database, send to queue, etc.\n\nasyncio.run(stream_conversion())\n```\n\n### Batch Processing\n\nConvert multiple PDFs efficiently:\n\n```python\nimport asyncio\nfrom pdf2markdown import PDFConverter\n\nasync def batch_convert():\n converter = PDFConverter()\n \n pdf_files = [\"doc1.pdf\", \"doc2.pdf\", \"doc3.pdf\"]\n results = await converter.process_batch(\n pdf_files,\n output_dir=\"./output\"\n )\n \n for result in results:\n if result.status == ConversionStatus.COMPLETED:\n print(f\"\u2713 {result.source_path}\")\n else:\n print(f\"\u2717 {result.source_path}: {result.error_message}\")\n\nasyncio.run(batch_convert())\n```\n\n### Loading Configuration from Files\n\n```python\nfrom pdf2markdown import PDFConverter, Config\n\n# From YAML file\nconfig = Config.from_yaml(\"config.yaml\")\nconverter = PDFConverter(config=config)\n\n# From dictionary\nconfig_dict = {\n \"llm_provider\": {\n \"provider_type\": \"openai\",\n \"api_key\": \"your-key\",\n \"model\": \"gpt-4o-mini\"\n },\n \"pipeline\": {\n \"page_workers\": 15\n }\n}\nconverter = PDFConverter(config=config_dict)\n```\n\n### Error Handling\n\n```python\nfrom pdf2markdown import (\n PDFConverter,\n PDFConversionError,\n ConfigurationError,\n ParsingError\n)\n\ntry:\n converter = PDFConverter()\n markdown = converter.convert_sync(\"document.pdf\")\nexcept ConfigurationError as e:\n print(f\"Configuration error: {e}\")\nexcept ParsingError as e:\n print(f\"Failed to parse PDF: {e}\")\n if e.page_number:\n print(f\"Error on page {e.page_number}\")\nexcept PDFConversionError as e:\n print(f\"Conversion failed: {e}\")\n```\n\n### Context Manager\n\nProperly clean up resources using context managers:\n\n```python\nimport asyncio\nfrom pdf2markdown import PDFConverter\n\nasync def convert_with_cleanup():\n async with PDFConverter() as converter:\n markdown = await converter.convert(\"document.pdf\")\n # Converter automatically cleaned up after this block\n return markdown\n\nmarkdown = asyncio.run(convert_with_cleanup())\n```\n\n### Integration Examples\n\n#### Flask Web Application\n\n```python\nfrom flask import Flask, request, jsonify\nfrom pdf2markdown import PDFConverter\n\napp = Flask(__name__)\nconverter = PDFConverter()\n\n@app.route('/convert', methods=['POST'])\ndef convert_pdf():\n if 'file' not in request.files:\n return jsonify({'error': 'No file provided'}), 400\n \n file = request.files['file']\n file.save('/tmp/upload.pdf')\n \n try:\n markdown = converter.convert_sync('/tmp/upload.pdf')\n return jsonify({'markdown': markdown})\n except Exception as e:\n return jsonify({'error': str(e)}), 500\n```\n\n#### Celery Task Queue\n\n```python\nfrom celery import Celery\nfrom pdf2markdown import PDFConverter\n\napp = Celery('tasks', broker='redis://localhost:6379')\nconverter = PDFConverter()\n\n@app.task\ndef convert_pdf_task(pdf_path):\n \"\"\"Background task to convert PDF\"\"\"\n return converter.convert_sync(pdf_path)\n```\n\n#### Document Processing Pipeline\n\n```python\nfrom pdf2markdown import PDFConverter, ConfigBuilder\nimport sqlite3\n\n# Configure for high-quality conversion\nconfig = ConfigBuilder() \\\n .with_openai(api_key=\"your-key\", model=\"gpt-4o\") \\\n .with_resolution(400) \\\n .with_validators(['markdown', 'repetition']) \\\n .build()\n\nconverter = PDFConverter(config=config)\n\ndef process_document(pdf_path, doc_id):\n \"\"\"Process document and store in database\"\"\"\n # Convert PDF\n markdown = converter.convert_sync(pdf_path)\n \n # Store in database\n conn = sqlite3.connect('documents.db')\n cursor = conn.cursor()\n cursor.execute(\n \"INSERT INTO documents (id, content) VALUES (?, ?)\",\n (doc_id, markdown)\n )\n conn.commit()\n conn.close()\n \n return doc_id\n```\n\n## CLI Usage\n\n### Basic Usage\n\n```bash\n# Convert with default settings\npdf2markdown document.pdf\n\n# Specify output file\npdf2markdown document.pdf -o converted.md\n\n# Use a specific model\npdf2markdown document.pdf --model gpt-4o\n\n# Adjust rendering resolution\npdf2markdown document.pdf --resolution 400\n```\n\n### Advanced Usage\n\n```bash\n# Use custom configuration file\npdf2markdown document.pdf --config my-config.yaml\n\n# Parallel processing with more workers\npdf2markdown document.pdf --page-workers 20\n\n# Disable progress logging for automation\npdf2markdown document.pdf --no-progress\n\n# Save configuration for reuse\npdf2markdown document.pdf --save-config my-settings.yaml\n\n# Specify table format (html or markdown)\npdf2markdown document.pdf --table-format html # For complex tables\npdf2markdown document.pdf --table-format markdown # For simple tables\n```\n\n### Configuration\n\n#### Initial Setup\n\nThe application uses a YAML configuration file to manage settings. To get started:\n\n1. **Copy the sample configuration:**\n ```bash\n cp config/default.sample.yaml config/default.yaml\n ```\n\n2. **Review and edit the configuration:**\n The sample file (`config/default.sample.yaml`) is heavily documented with explanations for every setting. Key sections to configure:\n - `llm_provider`: Your LLM API settings (endpoint, API key, model)\n - `document_parser`: PDF rendering settings\n - `pipeline`: Worker and processing settings\n\n3. **Set sensitive values via environment variables:**\n Instead of hardcoding API keys in the config file, use environment variables:\n ```bash\n export OPENAI_API_KEY=\"your-api-key-here\"\n ```\n Then reference it in your config as: `${OPENAI_API_KEY}`\n\n#### Configuration File Structure\n\nHere's an overview of the configuration structure:\n\n```yaml\n# LLM Provider Configuration (required)\nllm_provider:\n provider_type: openai # Provider type (currently supports \"openai\")\n endpoint: https://api.openai.com/v1 # API endpoint URL\n api_key: ${OPENAI_API_KEY} # Can reference environment variables\n model: gpt-4o-mini # Model to use\n max_tokens: 4096 # Maximum tokens in response\n temperature: 0.1 # Generation temperature (0.0-2.0)\n timeout: 60 # Request timeout in seconds\n \n # Penalty parameters to reduce repetition (all optional)\n presence_penalty: 0.0 # Penalize tokens based on presence (-2.0 to 2.0)\n frequency_penalty: 0.0 # Penalize tokens based on frequency (-2.0 to 2.0)\n repetition_penalty: null # Alternative repetition penalty (0.0 to 2.0, some providers only)\n\n# Document Parser Configuration\ndocument_parser:\n type: simple # Parser type\n resolution: 300 # DPI for rendering PDF pages to images\n cache_dir: /tmp/pdf2markdown/cache # Cache directory for rendered images\n max_page_size: 50000000 # Maximum page size in bytes (50MB)\n timeout: 30 # Timeout for rendering operations\n\n# Page Parser Configuration\npage_parser:\n type: simple_llm # Parser type\n prompt_template: null # Optional custom prompt template path\n additional_instructions: null # Optional additional LLM instructions\n \n # Table format configuration\n table_format: html # 'html' for complex layouts, 'markdown' for simple tables\n \n # Content validation pipeline configuration\n validate_content: true # Enable content validation\n \n validation:\n # List of validators to run (in order)\n validators: [\"markdown\", \"repetition\"]\n \n # Maximum number of correction attempts\n max_correction_attempts: 2\n \n # Markdown validator - checks syntax and formatting\n markdown:\n enabled: true # Enable this validator\n attempt_correction: true # Try to fix issues by re-prompting LLM\n strict_mode: false # Use relaxed mode for LLM-generated content\n max_line_length: 1000 # Max line length (MD013 rule)\n disabled_rules: [] # Additional rules to disable\n enabled_rules: [] # Specific rules to enable\n # Note: Common overly-strict rules are disabled by default including:\n # MD033 (Inline HTML) - common in technical documents and tables\n # MD026 (Trailing punctuation in headings) - common in PDF headings\n # MD042 (No empty links) - LLMs may generate placeholder links during extraction\n # MD041, MD022, MD031, MD032, MD025, MD024, MD013, MD047, MD040\n \n # Repetition validator - detects and corrects unwanted repetition\n repetition:\n enabled: true # Enable this validator\n attempt_correction: true # Try to fix repetition issues\n consecutive_threshold: 3 # Flag 3+ consecutive duplicate lines\n window_size: 10 # Check within 10-line windows\n window_threshold: 3 # Flag 3+ occurrences in window\n check_exact_lines: true # Check for exact duplicates\n check_normalized_lines: true # Check ignoring whitespace/punctuation\n check_paragraphs: true # Check for duplicate paragraphs\n check_patterns: true # Detect repetitive patterns\n min_pattern_length: 20 # Minimum chars for pattern detection\n pattern_similarity_threshold: 0.9 # Similarity threshold (0-1)\n min_line_length: 5 # Minimum line length to check\n\n# Pipeline Configuration\npipeline:\n document_workers: 1 # Must be 1 for sequential document processing\n page_workers: 10 # Number of parallel page processing workers\n queues:\n document_queue_size: 100\n page_queue_size: 1000\n output_queue_size: 500\n enable_progress: true # Show progress bars\n log_level: INFO # Logging level\n\n# Output Configuration\noutput_dir: ./output # Default output directory\ntemp_dir: /tmp/pdf2markdown # Temporary file directory\npage_separator: \"\\n\\n--[PAGE: {page_number}]--\\n\\n\" # Separator between pages\n```\n\n#### Configuration Hierarchy\n\nConfiguration values are loaded in the following order (later values override earlier ones):\n\n1. Default values in code\n2. Configuration file (`config/default.yaml` or file specified via `--config`)\n3. Environment variables\n4. Command-line arguments\n\n**Note:** The application looks for `config/default.yaml` in the current working directory by default. You can specify a different configuration file using the `--config` option:\n```bash\npdf2markdown input.pdf --config /path/to/my-config.yaml\n```\n\n#### LLM Provider Configuration\n\nThe `llm_provider` section is shared across all components that need LLM access. This centralized configuration makes it easy to:\n\n- Switch between different LLM providers\n- Use the same provider settings for multiple components\n- Override settings globally via environment variables or CLI\n\n**Supported Providers:**\n- `openai`: Any OpenAI-compatible API (OpenAI, Azure OpenAI, local servers with OpenAI-compatible endpoints)\n- `transformers`: Local models using HuggingFace Transformers (requires optional dependencies)\n\n**Future Providers (planned):**\n- `ollama`: Local models via Ollama\n- `anthropic`: Anthropic Claude API\n- `google`: Google Gemini API\n\n##### Penalty Parameters for Reducing Repetition\n\nTo avoid repetitive text in the generated markdown, you can configure penalty parameters:\n\n- **presence_penalty** (-2.0 to 2.0): Penalizes tokens that have already appeared in the text. Positive values discourage repetition.\n- **frequency_penalty** (-2.0 to 2.0): Penalizes tokens based on their frequency in the text so far. Positive values reduce repetition of common phrases.\n- **repetition_penalty** (0.0 to 2.0): Alternative parameter used by some providers (e.g., local models). Values > 1.0 reduce repetition.\n\n**Recommended settings for reducing repetition:**\n```yaml\nllm_provider:\n presence_penalty: 0.5\n frequency_penalty: 0.5\n # OR for providers that use repetition_penalty:\n repetition_penalty: 1.15\n```\n\n#### Custom OpenAI-Compatible Endpoints\n\nTo use a custom OpenAI-compatible endpoint (e.g., local LLM server, vLLM, etc.):\n\n```yaml\nllm_provider:\n provider_type: openai\n endpoint: http://localhost:8080/v1 # Your custom endpoint\n api_key: dummy-key # Some endpoints require a placeholder\n model: your-model-name\n max_tokens: 8192\n temperature: 0.7\n timeout: 120\n```\n\n#### Using Local Models with Transformers\n\nThe Transformers provider allows you to run models locally using HuggingFace Transformers. This is useful for:\n- Running without API costs\n- Processing sensitive documents locally\n- Using specialized models not available via APIs\n- Running on systems with GPU acceleration\n\n**Installation:**\n```bash\n# Install with transformers support\npip install -e \".[transformers]\"\n```\n\n**Configuration Example:**\n```yaml\nllm_provider:\n provider_type: transformers\n model_name: \"openbmb/MiniCPM-V-4\" # HuggingFace model ID\n device: \"auto\" # or \"cuda\", \"cpu\", \"cuda:0\", etc.\n torch_dtype: \"bfloat16\" # or \"float16\", \"float32\", \"auto\"\n max_tokens: 4096\n temperature: 0.1\n do_sample: false\n \n # Optional: Use 4-bit quantization to save memory\n load_in_4bit: true\n \n # Optional: For models with .chat() method\n use_chat_method: true\n```\n\n**Supported Models (examples):**\n- **MiniCPM-V series**: `openbmb/MiniCPM-V-4`, `openbmb/MiniCPM-V-2_6`\n- **Nanonets OCR**: `nanonets/Nanonets-OCR-s`\n- **Other vision models**: Any model supporting image-text-to-text generation\n\n**Performance Tips:**\n- Use `load_in_4bit: true` or `load_in_8bit: true` to reduce memory usage\n- Set `page_workers: 1` in pipeline config for local models (they use more memory)\n- Use `device_map: \"auto\"` for multi-GPU systems\n- Consider using `attn_implementation: \"flash_attention_2\"` for faster inference (if supported)\n\nSee `config/transformers_example.yaml` for a complete configuration example.\n\n## Environment Variables\n\n### LLM Provider Variables\n- `OPENAI_API_KEY`: Your OpenAI API key (required)\n- `OPENAI_API_ENDPOINT`: Custom API endpoint URL (optional)\n- `OPENAI_MODEL`: Model to use (default: gpt-4o-mini)\n\n### Application Variables\n- `PDF2MARKDOWN_CACHE_DIR`: Cache directory for rendered images\n- `PDF2MARKDOWN_OUTPUT_DIR`: Default output directory\n- `PDF2MARKDOWN_LOG_LEVEL`: Logging level (DEBUG, INFO, WARNING, ERROR)\n- `PDF2MARKDOWN_TEMP_DIR`: Temporary file directory\n\n## How It Works\n\n1. **Document Parsing**: PDF pages are rendered as high-resolution images using PyMuPDF\n2. **LLM Provider**: The configured LLM provider handles communication with the AI model\n3. **Image Processing**: Each page image is sent to the LLM with vision capabilities\n4. **Content Extraction**: The LLM extracts and formats content as Markdown\n5. **Validation Pipeline**: Content passes through multiple validators:\n - **Markdown Validator**: Checks syntax and formatting\n - **Repetition Validator**: Detects unwanted repetition patterns\n6. **Correction** (optional): If issues are found, the LLM is re-prompted with specific instructions to fix them\n7. **Assembly**: Processed pages are combined into a single Markdown document\n\n### Architecture Overview\n\nThe application uses a modular architecture with these key components:\n\n- **LLM Provider**: Abstraction layer for different LLM services (OpenAI, local models, etc.)\n- **Document Parser**: Converts PDF pages to images\n- **Page Parser**: Converts images to Markdown using LLM\n- **Validation Pipeline**: Extensible system with multiple validators:\n - **Markdown Validator**: Validates and corrects syntax issues\n - **Repetition Validator**: Detects and corrects unwanted repetition\n - Easily extensible for additional validators\n- **Pipeline**: Orchestrates the conversion process with parallel workers\n- **Queue System**: Manages work distribution across workers\n\n## Output Format\n\nThe converter preserves:\n- **Headers**: Converted to appropriate Markdown heading levels\n- **Tables**: Rendered as HTML tables or Markdown tables (configurable)\n- **Lists**: Both ordered and unordered lists\n- **Equations**: LaTeX format for mathematical expressions ($inline$ and $$display$$)\n- **Images**: Descriptions or captions preserved\n- **Formatting**: Bold, italic, code, and other text styling\n- **Technical Elements**: Pin diagrams, electrical characteristics, timing specifications\n- **Special Notations**: Notes, warnings, footnotes, and cross-references\n\n### Table Format Options\n\nThe converter supports two table formats, configurable via the `table_format` setting:\n\n#### HTML Tables (Default)\nHTML tables are recommended for complex layouts with:\n- Merged cells (colspan/rowspan)\n- Nested tables\n- Complex alignments\n- Multi-line cell content\n\nExample configuration:\n```yaml\npage_parser:\n table_format: html # Default setting\n```\n\nOutput example:\n```html\n<table>\n <thead>\n <tr>\n <th rowspan=\"2\">Parameter</th>\n <th colspan=\"3\">Conditions</th>\n <th>Unit</th>\n </tr>\n <tr>\n <th>Min</th>\n <th>Typ</th>\n <th>Max</th>\n <th></th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <td>Operating Voltage</td>\n <td>1.7</td>\n <td>3.3</td>\n <td>3.6</td>\n <td>V</td>\n </tr>\n </tbody>\n</table>\n```\n\n#### Markdown Tables\nMarkdown tables are simpler and more readable in plain text, best for:\n- Simple tabular data\n- Tables without merged cells\n- Basic alignment needs\n\nExample configuration:\n```yaml\npage_parser:\n table_format: markdown\n```\n\nOutput example:\n```markdown\n| Parameter | Min | Typ | Max | Unit |\n|-----------|----:|----:|----:|------|\n| Voltage | 1.7 | 3.3 | 3.6 | V |\n| Current | 0.1 | 0.5 | 1.0 | mA |\n```\n\n### Output Quality\n\nThe converter ensures high-quality output through multiple mechanisms:\n\n#### Output Purity\n- Outputs **ONLY** the content from the PDF document\n- No explanatory text or comments\n- No \"Here is the content\" preambles\n- No additional formatting suggestions\n- Automatically removes markdown code fences if LLM wraps output\n- Just clean, accurate Markdown representing the original document\n\n#### Validation Pipeline\n- **Syntax Validation**: Ensures proper markdown formatting\n- **Repetition Detection**: Identifies and corrects various types of repetition:\n - Consecutive duplicate lines\n - Near-duplicates within sliding windows\n - Duplicate paragraphs\n - Repetitive patterns\n- **Extensible System**: Easy to add custom validators for specific needs\n\n### Page Separation\n\nPages are separated using a configurable separator (default: `--[PAGE: N]--`). You can customize this in the configuration:\n```yaml\n# Examples of page separators:\npage_separator: \"\\n---\\n\" # Simple horizontal rule\npage_separator: \"\\n\\n<!-- Page {page_number} -->\\n\\n\" # HTML comment (invisible)\npage_separator: \"\\n\\n# Page {page_number}\\n\\n\" # Markdown heading\npage_separator: \"\\n\\n--[PAGE: {page_number}]--\\n\\n\" # Default format\n```\n\n## Performance\n\n- Processes pages in parallel (default: 10 workers)\n- Automatic caching of rendered images\n- Typical processing: 5-10 seconds per page\n\n## Requirements\n\n- Python 3.10+\n- OpenAI API key (or compatible endpoint)\n- System dependencies for PyMuPDF\n\n## Configuration Examples\n\n### Using Azure OpenAI\n\n```yaml\nllm_provider:\n provider_type: openai\n endpoint: https://your-resource.openai.azure.com/\n api_key: ${AZURE_OPENAI_KEY}\n model: gpt-4-vision\n max_tokens: 4096\n```\n\n### Using Local LLM Server\n\n```yaml\nllm_provider:\n provider_type: openai\n endpoint: http://localhost:11434/v1 # Ollama with OpenAI compatibility\n api_key: not-needed\n model: llava:13b\n max_tokens: 8192\n timeout: 300 # Longer timeout for local models\n # Many local servers use repetition_penalty instead\n repetition_penalty: 1.15\n```\n\n### High-Performance Configuration\n\n```yaml\nllm_provider:\n provider_type: openai\n endpoint: https://api.openai.com/v1\n api_key: ${OPENAI_API_KEY}\n model: gpt-4o\n max_tokens: 8192\n temperature: 0.1\n # Reduce repetition for better quality output\n presence_penalty: 0.5\n frequency_penalty: 0.5\n\npipeline:\n page_workers: 20 # More parallel workers for faster processing\n\ndocument_parser:\n resolution: 400 # Higher quality images\n```\n\n## Troubleshooting\n\n### API Key Issues\n```bash\n# Verify API key is set\necho $OPENAI_API_KEY\n\n# Set in .env file\necho \"OPENAI_API_KEY=your-key\" > .env\n\n# Check configuration\npdf2markdown document.pdf --save-config debug-config.yaml\n# Then inspect debug-config.yaml\n```\n\n### Memory Issues\n```bash\n# Reduce worker count\npdf2markdown large.pdf --page-workers 5\n\n# Lower resolution\npdf2markdown large.pdf --resolution 200\n```\n\n### Debugging\n```bash\n# Enable debug logging\npdf2markdown document.pdf --log-level DEBUG\n\n# Check cache directory\nls /tmp/pdf2markdown/cache/\n```\n\n## Development\n\n### Running Tests\n```bash\nhatch run test\n```\n\n### Code Formatting\n```bash\nhatch run format\n```\n\n### Type Checking\n```bash\nhatch run typecheck\n```\n\n## License\n\nMIT License - see LICENSE file for details\n",
"bugtrack_url": null,
"license": null,
"summary": "Python library and CLI tool that leverages LLMs to convert technical PDF documents to well-structured Markdown",
"version": "0.2.0",
"project_urls": {
"Documentation": "https://github.com/juanqui/pdf2markdown#readme",
"Issues": "https://github.com/juanqui/pdf2markdown/issues",
"Source": "https://github.com/juanqui/pdf2markdown"
},
"split_keywords": [
"ai",
" conversion",
" document-conversion",
" document-processing",
" gpt",
" llm",
" machine-learning",
" markdown",
" ocr",
" openai",
" pdf",
" pdf-parser",
" pdf-to-text",
" transformers",
" vision"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "5afd089860d197200c53e960d6568dba156654f13127b65f6327cd46eb4ff6a3",
"md5": "0aa9bfa87e94761563269663a0c7302f",
"sha256": "48fee172c1331d2828d442a6b67adbf8537e71fa9d7f8bbfa860d0ab25760737"
},
"downloads": -1,
"filename": "pdf2markdown-0.2.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "0aa9bfa87e94761563269663a0c7302f",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 80875,
"upload_time": "2025-08-17T20:03:06",
"upload_time_iso_8601": "2025-08-17T20:03:06.932826Z",
"url": "https://files.pythonhosted.org/packages/5a/fd/089860d197200c53e960d6568dba156654f13127b65f6327cd46eb4ff6a3/pdf2markdown-0.2.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "391867c2189a8957ae328749122babc50f6ddaaed92fa0447140fe210b76b624",
"md5": "46e07fceba7ddfa0198833ca476e4280",
"sha256": "b9a9ac4eb162d0c40c8926c65cc8656ce6b2fbffb8ef88498b7ebcc370f73b11"
},
"downloads": -1,
"filename": "pdf2markdown-0.2.0.tar.gz",
"has_sig": false,
"md5_digest": "46e07fceba7ddfa0198833ca476e4280",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 117700,
"upload_time": "2025-08-17T20:03:08",
"upload_time_iso_8601": "2025-08-17T20:03:08.487206Z",
"url": "https://files.pythonhosted.org/packages/39/18/67c2189a8957ae328749122babc50f6ddaaed92fa0447140fe210b76b624/pdf2markdown-0.2.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-17 20:03:08",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "juanqui",
"github_project": "pdf2markdown#readme",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "pdf2markdown"
}