pdf2markdown


Namepdf2markdown JSON
Version 0.2.0 PyPI version JSON
download
home_pageNone
SummaryPython library and CLI tool that leverages LLMs to convert technical PDF documents to well-structured Markdown
upload_time2025-08-17 20:03:08
maintainerNone
docs_urlNone
authorNone
requires_python>=3.10
licenseNone
keywords ai conversion document-conversion document-processing gpt llm machine-learning markdown ocr openai pdf pdf-parser pdf-to-text transformers vision
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # PDF to Markdown Converter

A Python application that leverages Large Language Models (LLMs) to accurately convert technical PDF documents into well-structured Markdown documents.

## Features

- ๐Ÿš€ **High-Quality Conversion**: Uses state-of-the-art LLMs for accurate text extraction
- ๐Ÿ“Š **Table Preservation**: Converts tables to HTML or Markdown format (configurable)
- ๐Ÿ”ข **Equation Support**: Preserves mathematical equations in LaTeX format
- ๐Ÿ–ผ๏ธ **Image Handling**: Describes images and preserves captions
- โšก **Parallel Processing**: Processes multiple pages concurrently for speed
- ๐Ÿ“ˆ **Progress Tracking**: Clear logging of processing status
- ๐Ÿ”ง **Configurable**: Extensive configuration options via YAML or CLI
- ๐Ÿ”„ **Retry Logic**: Automatic retry with exponential backoff for reliability
- โœ… **Validation Pipeline**: Extensible validation system with multiple validators
- ๐Ÿ” **Repetition Detection**: Automatically detects and corrects content repetition
- โœ”๏ธ **Markdown Validation**: Built-in syntax validation and correction using PyMarkdown
- ๐ŸŽฏ **Pure Output**: Generates only document content without additional commentary
- ๐Ÿงน **Smart Cleaning**: Automatically removes markdown code fences that LLMs sometimes add
- ๐Ÿ“„ **Configurable Page Separators**: Customize how pages are separated in the output

## Installation

### From PyPI (Coming Soon)

```bash
pip install pdf2markdown
```

### Using Hatch (Development)

```bash
# Install Hatch
pipx install hatch

# Clone the repository
git clone https://github.com/juanqui/pdf2markdown.git
cd pdf2markdown

# Install dependencies
hatch env create

# Activate environment
hatch shell
```

### Using pip

```bash
# Clone the repository
git clone https://github.com/juanqui/pdf2markdown.git
cd pdf2markdown

# Install the package
pip install -e .

# Optional: Install with transformers support for local models
pip install -e ".[transformers]"
```

## Quick Start

1. **Set up configuration:**
```bash
# Copy the sample configuration file
cp config/default.sample.yaml config/default.yaml

# Edit the configuration file with your settings
# At minimum, update the llm_provider section with your API details
nano config/default.yaml  # or use your preferred editor
```

2. **Set your API key (recommended via environment variable):**
```bash
export OPENAI_API_KEY="your-api-key-here"
```

3. **Convert a PDF:**
```bash
pdf2markdown input.pdf -o output.md
```

## Library Usage

`pdf2markdown` can be used as a Python library in your own applications. This is useful for integrating PDF conversion into larger systems, web applications, or data processing pipelines.

### Simple Usage

```python
from pdf2markdown import PDFConverter

# Create converter with default settings
converter = PDFConverter()

# Convert a PDF to markdown
markdown_text = converter.convert_sync("document.pdf")
print(markdown_text)

# Save to a file
markdown_text = converter.convert_sync("document.pdf", "output.md")
```

### Configuration Options

```python
from pdf2markdown import PDFConverter, ConfigBuilder

# Build configuration programmatically
config = ConfigBuilder() \
    .with_openai(api_key="your-api-key", model="gpt-4o") \
    .with_resolution(400) \
    .with_page_workers(20) \
    .with_cache_dir("/tmp/my_cache") \
    .build()

converter = PDFConverter(config=config)
markdown = converter.convert_sync("document.pdf")
```

### Table Format Configuration

```python
from pdf2markdown import ConfigBuilder, PDFConverter

# Configure for HTML tables (better for complex layouts)
config = ConfigBuilder() \
    .with_openai(api_key="your-api-key") \
    .build()

# Set table format in the configuration
config['page_parser']['table_format'] = 'html'  # Default

converter = PDFConverter(config=config)

# Or configure for Markdown tables (simpler format)
config['page_parser']['table_format'] = 'markdown'
```

### Using Different LLM Providers

```python
from pdf2markdown import ConfigBuilder, PDFConverter

# OpenAI (or compatible endpoints)
config = ConfigBuilder() \
    .with_openai(
        api_key="your-key",
        model="gpt-4o-mini",
        endpoint="https://api.openai.com/v1"  # or your custom endpoint
    ) \
    .build()

# Local models with Transformers
config = ConfigBuilder() \
    .with_transformers(
        model_name="microsoft/Phi-3.5-vision-instruct",
        device="cuda",  # or "cpu"
        torch_dtype="float16"
    ) \
    .build()

converter = PDFConverter(config=config)
```

### Async Usage

```python
import asyncio
from pdf2markdown import PDFConverter

async def convert_pdf():
    converter = PDFConverter()
    
    # Async conversion
    markdown = await converter.convert("document.pdf")
    
    # With progress callback
    async def progress(current, total, message):
        print(f"Progress: {current}/{total} - {message}")
    
    markdown = await converter.convert(
        "document.pdf",
        progress_callback=progress
    )
    
    return markdown

# Run async function
markdown = asyncio.run(convert_pdf())
```

### Streaming Pages

Process large documents page by page as they complete:

```python
import asyncio
from pdf2markdown import PDFConverter

async def stream_conversion():
    converter = PDFConverter()
    
    async for page in converter.stream_pages("large_document.pdf"):
        print(f"Page {page.page_number}: {len(page.content)} characters")
        # Process each page as it completes
        # e.g., save to database, send to queue, etc.

asyncio.run(stream_conversion())
```

### Batch Processing

Convert multiple PDFs efficiently:

```python
import asyncio
from pdf2markdown import PDFConverter

async def batch_convert():
    converter = PDFConverter()
    
    pdf_files = ["doc1.pdf", "doc2.pdf", "doc3.pdf"]
    results = await converter.process_batch(
        pdf_files,
        output_dir="./output"
    )
    
    for result in results:
        if result.status == ConversionStatus.COMPLETED:
            print(f"โœ“ {result.source_path}")
        else:
            print(f"โœ— {result.source_path}: {result.error_message}")

asyncio.run(batch_convert())
```

### Loading Configuration from Files

```python
from pdf2markdown import PDFConverter, Config

# From YAML file
config = Config.from_yaml("config.yaml")
converter = PDFConverter(config=config)

# From dictionary
config_dict = {
    "llm_provider": {
        "provider_type": "openai",
        "api_key": "your-key",
        "model": "gpt-4o-mini"
    },
    "pipeline": {
        "page_workers": 15
    }
}
converter = PDFConverter(config=config_dict)
```

### Error Handling

```python
from pdf2markdown import (
    PDFConverter,
    PDFConversionError,
    ConfigurationError,
    ParsingError
)

try:
    converter = PDFConverter()
    markdown = converter.convert_sync("document.pdf")
except ConfigurationError as e:
    print(f"Configuration error: {e}")
except ParsingError as e:
    print(f"Failed to parse PDF: {e}")
    if e.page_number:
        print(f"Error on page {e.page_number}")
except PDFConversionError as e:
    print(f"Conversion failed: {e}")
```

### Context Manager

Properly clean up resources using context managers:

```python
import asyncio
from pdf2markdown import PDFConverter

async def convert_with_cleanup():
    async with PDFConverter() as converter:
        markdown = await converter.convert("document.pdf")
        # Converter automatically cleaned up after this block
    return markdown

markdown = asyncio.run(convert_with_cleanup())
```

### Integration Examples

#### Flask Web Application

```python
from flask import Flask, request, jsonify
from pdf2markdown import PDFConverter

app = Flask(__name__)
converter = PDFConverter()

@app.route('/convert', methods=['POST'])
def convert_pdf():
    if 'file' not in request.files:
        return jsonify({'error': 'No file provided'}), 400
    
    file = request.files['file']
    file.save('/tmp/upload.pdf')
    
    try:
        markdown = converter.convert_sync('/tmp/upload.pdf')
        return jsonify({'markdown': markdown})
    except Exception as e:
        return jsonify({'error': str(e)}), 500
```

#### Celery Task Queue

```python
from celery import Celery
from pdf2markdown import PDFConverter

app = Celery('tasks', broker='redis://localhost:6379')
converter = PDFConverter()

@app.task
def convert_pdf_task(pdf_path):
    """Background task to convert PDF"""
    return converter.convert_sync(pdf_path)
```

#### Document Processing Pipeline

```python
from pdf2markdown import PDFConverter, ConfigBuilder
import sqlite3

# Configure for high-quality conversion
config = ConfigBuilder() \
    .with_openai(api_key="your-key", model="gpt-4o") \
    .with_resolution(400) \
    .with_validators(['markdown', 'repetition']) \
    .build()

converter = PDFConverter(config=config)

def process_document(pdf_path, doc_id):
    """Process document and store in database"""
    # Convert PDF
    markdown = converter.convert_sync(pdf_path)
    
    # Store in database
    conn = sqlite3.connect('documents.db')
    cursor = conn.cursor()
    cursor.execute(
        "INSERT INTO documents (id, content) VALUES (?, ?)",
        (doc_id, markdown)
    )
    conn.commit()
    conn.close()
    
    return doc_id
```

## CLI Usage

### Basic Usage

```bash
# Convert with default settings
pdf2markdown document.pdf

# Specify output file
pdf2markdown document.pdf -o converted.md

# Use a specific model
pdf2markdown document.pdf --model gpt-4o

# Adjust rendering resolution
pdf2markdown document.pdf --resolution 400
```

### Advanced Usage

```bash
# Use custom configuration file
pdf2markdown document.pdf --config my-config.yaml

# Parallel processing with more workers
pdf2markdown document.pdf --page-workers 20

# Disable progress logging for automation
pdf2markdown document.pdf --no-progress

# Save configuration for reuse
pdf2markdown document.pdf --save-config my-settings.yaml

# Specify table format (html or markdown)
pdf2markdown document.pdf --table-format html  # For complex tables
pdf2markdown document.pdf --table-format markdown  # For simple tables
```

### Configuration

#### Initial Setup

The application uses a YAML configuration file to manage settings. To get started:

1. **Copy the sample configuration:**
   ```bash
   cp config/default.sample.yaml config/default.yaml
   ```

2. **Review and edit the configuration:**
   The sample file (`config/default.sample.yaml`) is heavily documented with explanations for every setting. Key sections to configure:
   - `llm_provider`: Your LLM API settings (endpoint, API key, model)
   - `document_parser`: PDF rendering settings
   - `pipeline`: Worker and processing settings

3. **Set sensitive values via environment variables:**
   Instead of hardcoding API keys in the config file, use environment variables:
   ```bash
   export OPENAI_API_KEY="your-api-key-here"
   ```
   Then reference it in your config as: `${OPENAI_API_KEY}`

#### Configuration File Structure

Here's an overview of the configuration structure:

```yaml
# LLM Provider Configuration (required)
llm_provider:
  provider_type: openai  # Provider type (currently supports "openai")
  endpoint: https://api.openai.com/v1  # API endpoint URL
  api_key: ${OPENAI_API_KEY}  # Can reference environment variables
  model: gpt-4o-mini  # Model to use
  max_tokens: 4096  # Maximum tokens in response
  temperature: 0.1  # Generation temperature (0.0-2.0)
  timeout: 60  # Request timeout in seconds
  
  # Penalty parameters to reduce repetition (all optional)
  presence_penalty: 0.0  # Penalize tokens based on presence (-2.0 to 2.0)
  frequency_penalty: 0.0  # Penalize tokens based on frequency (-2.0 to 2.0)
  repetition_penalty: null  # Alternative repetition penalty (0.0 to 2.0, some providers only)

# Document Parser Configuration
document_parser:
  type: simple  # Parser type
  resolution: 300  # DPI for rendering PDF pages to images
  cache_dir: /tmp/pdf2markdown/cache  # Cache directory for rendered images
  max_page_size: 50000000  # Maximum page size in bytes (50MB)
  timeout: 30  # Timeout for rendering operations

# Page Parser Configuration
page_parser:
  type: simple_llm  # Parser type
  prompt_template: null  # Optional custom prompt template path
  additional_instructions: null  # Optional additional LLM instructions
  
  # Table format configuration
  table_format: html  # 'html' for complex layouts, 'markdown' for simple tables
  
  # Content validation pipeline configuration
  validate_content: true  # Enable content validation
  
  validation:
    # List of validators to run (in order)
    validators: ["markdown", "repetition"]
    
    # Maximum number of correction attempts
    max_correction_attempts: 2
    
    # Markdown validator - checks syntax and formatting
    markdown:
      enabled: true  # Enable this validator
      attempt_correction: true  # Try to fix issues by re-prompting LLM
      strict_mode: false  # Use relaxed mode for LLM-generated content
      max_line_length: 1000  # Max line length (MD013 rule)
      disabled_rules: []  # Additional rules to disable
      enabled_rules: []  # Specific rules to enable
      # Note: Common overly-strict rules are disabled by default including:
      # MD033 (Inline HTML) - common in technical documents and tables
      # MD026 (Trailing punctuation in headings) - common in PDF headings
      # MD042 (No empty links) - LLMs may generate placeholder links during extraction
      # MD041, MD022, MD031, MD032, MD025, MD024, MD013, MD047, MD040
    
    # Repetition validator - detects and corrects unwanted repetition
    repetition:
      enabled: true  # Enable this validator
      attempt_correction: true  # Try to fix repetition issues
      consecutive_threshold: 3  # Flag 3+ consecutive duplicate lines
      window_size: 10  # Check within 10-line windows
      window_threshold: 3  # Flag 3+ occurrences in window
      check_exact_lines: true  # Check for exact duplicates
      check_normalized_lines: true  # Check ignoring whitespace/punctuation
      check_paragraphs: true  # Check for duplicate paragraphs
      check_patterns: true  # Detect repetitive patterns
      min_pattern_length: 20  # Minimum chars for pattern detection
      pattern_similarity_threshold: 0.9  # Similarity threshold (0-1)
      min_line_length: 5  # Minimum line length to check

# Pipeline Configuration
pipeline:
  document_workers: 1  # Must be 1 for sequential document processing
  page_workers: 10  # Number of parallel page processing workers
  queues:
    document_queue_size: 100
    page_queue_size: 1000
    output_queue_size: 500
  enable_progress: true  # Show progress bars
  log_level: INFO  # Logging level

# Output Configuration
output_dir: ./output  # Default output directory
temp_dir: /tmp/pdf2markdown  # Temporary file directory
page_separator: "\n\n--[PAGE: {page_number}]--\n\n"  # Separator between pages
```

#### Configuration Hierarchy

Configuration values are loaded in the following order (later values override earlier ones):

1. Default values in code
2. Configuration file (`config/default.yaml` or file specified via `--config`)
3. Environment variables
4. Command-line arguments

**Note:** The application looks for `config/default.yaml` in the current working directory by default. You can specify a different configuration file using the `--config` option:
```bash
pdf2markdown input.pdf --config /path/to/my-config.yaml
```

#### LLM Provider Configuration

The `llm_provider` section is shared across all components that need LLM access. This centralized configuration makes it easy to:

- Switch between different LLM providers
- Use the same provider settings for multiple components
- Override settings globally via environment variables or CLI

**Supported Providers:**
- `openai`: Any OpenAI-compatible API (OpenAI, Azure OpenAI, local servers with OpenAI-compatible endpoints)
- `transformers`: Local models using HuggingFace Transformers (requires optional dependencies)

**Future Providers (planned):**
- `ollama`: Local models via Ollama
- `anthropic`: Anthropic Claude API
- `google`: Google Gemini API

##### Penalty Parameters for Reducing Repetition

To avoid repetitive text in the generated markdown, you can configure penalty parameters:

- **presence_penalty** (-2.0 to 2.0): Penalizes tokens that have already appeared in the text. Positive values discourage repetition.
- **frequency_penalty** (-2.0 to 2.0): Penalizes tokens based on their frequency in the text so far. Positive values reduce repetition of common phrases.
- **repetition_penalty** (0.0 to 2.0): Alternative parameter used by some providers (e.g., local models). Values > 1.0 reduce repetition.

**Recommended settings for reducing repetition:**
```yaml
llm_provider:
  presence_penalty: 0.5
  frequency_penalty: 0.5
  # OR for providers that use repetition_penalty:
  repetition_penalty: 1.15
```

#### Custom OpenAI-Compatible Endpoints

To use a custom OpenAI-compatible endpoint (e.g., local LLM server, vLLM, etc.):

```yaml
llm_provider:
  provider_type: openai
  endpoint: http://localhost:8080/v1  # Your custom endpoint
  api_key: dummy-key  # Some endpoints require a placeholder
  model: your-model-name
  max_tokens: 8192
  temperature: 0.7
  timeout: 120
```

#### Using Local Models with Transformers

The Transformers provider allows you to run models locally using HuggingFace Transformers. This is useful for:
- Running without API costs
- Processing sensitive documents locally
- Using specialized models not available via APIs
- Running on systems with GPU acceleration

**Installation:**
```bash
# Install with transformers support
pip install -e ".[transformers]"
```

**Configuration Example:**
```yaml
llm_provider:
  provider_type: transformers
  model_name: "openbmb/MiniCPM-V-4"  # HuggingFace model ID
  device: "auto"  # or "cuda", "cpu", "cuda:0", etc.
  torch_dtype: "bfloat16"  # or "float16", "float32", "auto"
  max_tokens: 4096
  temperature: 0.1
  do_sample: false
  
  # Optional: Use 4-bit quantization to save memory
  load_in_4bit: true
  
  # Optional: For models with .chat() method
  use_chat_method: true
```

**Supported Models (examples):**
- **MiniCPM-V series**: `openbmb/MiniCPM-V-4`, `openbmb/MiniCPM-V-2_6`
- **Nanonets OCR**: `nanonets/Nanonets-OCR-s`
- **Other vision models**: Any model supporting image-text-to-text generation

**Performance Tips:**
- Use `load_in_4bit: true` or `load_in_8bit: true` to reduce memory usage
- Set `page_workers: 1` in pipeline config for local models (they use more memory)
- Use `device_map: "auto"` for multi-GPU systems
- Consider using `attn_implementation: "flash_attention_2"` for faster inference (if supported)

See `config/transformers_example.yaml` for a complete configuration example.

## Environment Variables

### LLM Provider Variables
- `OPENAI_API_KEY`: Your OpenAI API key (required)
- `OPENAI_API_ENDPOINT`: Custom API endpoint URL (optional)
- `OPENAI_MODEL`: Model to use (default: gpt-4o-mini)

### Application Variables
- `PDF2MARKDOWN_CACHE_DIR`: Cache directory for rendered images
- `PDF2MARKDOWN_OUTPUT_DIR`: Default output directory
- `PDF2MARKDOWN_LOG_LEVEL`: Logging level (DEBUG, INFO, WARNING, ERROR)
- `PDF2MARKDOWN_TEMP_DIR`: Temporary file directory

## How It Works

1. **Document Parsing**: PDF pages are rendered as high-resolution images using PyMuPDF
2. **LLM Provider**: The configured LLM provider handles communication with the AI model
3. **Image Processing**: Each page image is sent to the LLM with vision capabilities
4. **Content Extraction**: The LLM extracts and formats content as Markdown
5. **Validation Pipeline**: Content passes through multiple validators:
   - **Markdown Validator**: Checks syntax and formatting
   - **Repetition Validator**: Detects unwanted repetition patterns
6. **Correction** (optional): If issues are found, the LLM is re-prompted with specific instructions to fix them
7. **Assembly**: Processed pages are combined into a single Markdown document

### Architecture Overview

The application uses a modular architecture with these key components:

- **LLM Provider**: Abstraction layer for different LLM services (OpenAI, local models, etc.)
- **Document Parser**: Converts PDF pages to images
- **Page Parser**: Converts images to Markdown using LLM
- **Validation Pipeline**: Extensible system with multiple validators:
  - **Markdown Validator**: Validates and corrects syntax issues
  - **Repetition Validator**: Detects and corrects unwanted repetition
  - Easily extensible for additional validators
- **Pipeline**: Orchestrates the conversion process with parallel workers
- **Queue System**: Manages work distribution across workers

## Output Format

The converter preserves:
- **Headers**: Converted to appropriate Markdown heading levels
- **Tables**: Rendered as HTML tables or Markdown tables (configurable)
- **Lists**: Both ordered and unordered lists
- **Equations**: LaTeX format for mathematical expressions ($inline$ and $$display$$)
- **Images**: Descriptions or captions preserved
- **Formatting**: Bold, italic, code, and other text styling
- **Technical Elements**: Pin diagrams, electrical characteristics, timing specifications
- **Special Notations**: Notes, warnings, footnotes, and cross-references

### Table Format Options

The converter supports two table formats, configurable via the `table_format` setting:

#### HTML Tables (Default)
HTML tables are recommended for complex layouts with:
- Merged cells (colspan/rowspan)
- Nested tables
- Complex alignments
- Multi-line cell content

Example configuration:
```yaml
page_parser:
  table_format: html  # Default setting
```

Output example:
```html
<table>
  <thead>
    <tr>
      <th rowspan="2">Parameter</th>
      <th colspan="3">Conditions</th>
      <th>Unit</th>
    </tr>
    <tr>
      <th>Min</th>
      <th>Typ</th>
      <th>Max</th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Operating Voltage</td>
      <td>1.7</td>
      <td>3.3</td>
      <td>3.6</td>
      <td>V</td>
    </tr>
  </tbody>
</table>
```

#### Markdown Tables
Markdown tables are simpler and more readable in plain text, best for:
- Simple tabular data
- Tables without merged cells
- Basic alignment needs

Example configuration:
```yaml
page_parser:
  table_format: markdown
```

Output example:
```markdown
| Parameter | Min | Typ | Max | Unit |
|-----------|----:|----:|----:|------|
| Voltage   | 1.7 | 3.3 | 3.6 | V    |
| Current   | 0.1 | 0.5 | 1.0 | mA   |
```

### Output Quality

The converter ensures high-quality output through multiple mechanisms:

#### Output Purity
- Outputs **ONLY** the content from the PDF document
- No explanatory text or comments
- No "Here is the content" preambles
- No additional formatting suggestions
- Automatically removes markdown code fences if LLM wraps output
- Just clean, accurate Markdown representing the original document

#### Validation Pipeline
- **Syntax Validation**: Ensures proper markdown formatting
- **Repetition Detection**: Identifies and corrects various types of repetition:
  - Consecutive duplicate lines
  - Near-duplicates within sliding windows
  - Duplicate paragraphs
  - Repetitive patterns
- **Extensible System**: Easy to add custom validators for specific needs

### Page Separation

Pages are separated using a configurable separator (default: `--[PAGE: N]--`). You can customize this in the configuration:
```yaml
# Examples of page separators:
page_separator: "\n---\n"                           # Simple horizontal rule
page_separator: "\n\n<!-- Page {page_number} -->\n\n"  # HTML comment (invisible)
page_separator: "\n\n# Page {page_number}\n\n"         # Markdown heading
page_separator: "\n\n--[PAGE: {page_number}]--\n\n"    # Default format
```

## Performance

- Processes pages in parallel (default: 10 workers)
- Automatic caching of rendered images
- Typical processing: 5-10 seconds per page

## Requirements

- Python 3.10+
- OpenAI API key (or compatible endpoint)
- System dependencies for PyMuPDF

## Configuration Examples

### Using Azure OpenAI

```yaml
llm_provider:
  provider_type: openai
  endpoint: https://your-resource.openai.azure.com/
  api_key: ${AZURE_OPENAI_KEY}
  model: gpt-4-vision
  max_tokens: 4096
```

### Using Local LLM Server

```yaml
llm_provider:
  provider_type: openai
  endpoint: http://localhost:11434/v1  # Ollama with OpenAI compatibility
  api_key: not-needed
  model: llava:13b
  max_tokens: 8192
  timeout: 300  # Longer timeout for local models
  # Many local servers use repetition_penalty instead
  repetition_penalty: 1.15
```

### High-Performance Configuration

```yaml
llm_provider:
  provider_type: openai
  endpoint: https://api.openai.com/v1
  api_key: ${OPENAI_API_KEY}
  model: gpt-4o
  max_tokens: 8192
  temperature: 0.1
  # Reduce repetition for better quality output
  presence_penalty: 0.5
  frequency_penalty: 0.5

pipeline:
  page_workers: 20  # More parallel workers for faster processing

document_parser:
  resolution: 400  # Higher quality images
```

## Troubleshooting

### API Key Issues
```bash
# Verify API key is set
echo $OPENAI_API_KEY

# Set in .env file
echo "OPENAI_API_KEY=your-key" > .env

# Check configuration
pdf2markdown document.pdf --save-config debug-config.yaml
# Then inspect debug-config.yaml
```

### Memory Issues
```bash
# Reduce worker count
pdf2markdown large.pdf --page-workers 5

# Lower resolution
pdf2markdown large.pdf --resolution 200
```

### Debugging
```bash
# Enable debug logging
pdf2markdown document.pdf --log-level DEBUG

# Check cache directory
ls /tmp/pdf2markdown/cache/
```

## Development

### Running Tests
```bash
hatch run test
```

### Code Formatting
```bash
hatch run format
```

### Type Checking
```bash
hatch run typecheck
```

## License

MIT License - see LICENSE file for details

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "pdf2markdown",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": "Juan Villa <juanqui@villafam.com>",
    "keywords": "ai, conversion, document-conversion, document-processing, gpt, llm, machine-learning, markdown, ocr, openai, pdf, pdf-parser, pdf-to-text, transformers, vision",
    "author": null,
    "author_email": "Juan Villa <juanqui@villafam.com>",
    "download_url": "https://files.pythonhosted.org/packages/39/18/67c2189a8957ae328749122babc50f6ddaaed92fa0447140fe210b76b624/pdf2markdown-0.2.0.tar.gz",
    "platform": null,
    "description": "# PDF to Markdown Converter\n\nA Python application that leverages Large Language Models (LLMs) to accurately convert technical PDF documents into well-structured Markdown documents.\n\n## Features\n\n- \ud83d\ude80 **High-Quality Conversion**: Uses state-of-the-art LLMs for accurate text extraction\n- \ud83d\udcca **Table Preservation**: Converts tables to HTML or Markdown format (configurable)\n- \ud83d\udd22 **Equation Support**: Preserves mathematical equations in LaTeX format\n- \ud83d\uddbc\ufe0f **Image Handling**: Describes images and preserves captions\n- \u26a1 **Parallel Processing**: Processes multiple pages concurrently for speed\n- \ud83d\udcc8 **Progress Tracking**: Clear logging of processing status\n- \ud83d\udd27 **Configurable**: Extensive configuration options via YAML or CLI\n- \ud83d\udd04 **Retry Logic**: Automatic retry with exponential backoff for reliability\n- \u2705 **Validation Pipeline**: Extensible validation system with multiple validators\n- \ud83d\udd0d **Repetition Detection**: Automatically detects and corrects content repetition\n- \u2714\ufe0f **Markdown Validation**: Built-in syntax validation and correction using PyMarkdown\n- \ud83c\udfaf **Pure Output**: Generates only document content without additional commentary\n- \ud83e\uddf9 **Smart Cleaning**: Automatically removes markdown code fences that LLMs sometimes add\n- \ud83d\udcc4 **Configurable Page Separators**: Customize how pages are separated in the output\n\n## Installation\n\n### From PyPI (Coming Soon)\n\n```bash\npip install pdf2markdown\n```\n\n### Using Hatch (Development)\n\n```bash\n# Install Hatch\npipx install hatch\n\n# Clone the repository\ngit clone https://github.com/juanqui/pdf2markdown.git\ncd pdf2markdown\n\n# Install dependencies\nhatch env create\n\n# Activate environment\nhatch shell\n```\n\n### Using pip\n\n```bash\n# Clone the repository\ngit clone https://github.com/juanqui/pdf2markdown.git\ncd pdf2markdown\n\n# Install the package\npip install -e .\n\n# Optional: Install with transformers support for local models\npip install -e \".[transformers]\"\n```\n\n## Quick Start\n\n1. **Set up configuration:**\n```bash\n# Copy the sample configuration file\ncp config/default.sample.yaml config/default.yaml\n\n# Edit the configuration file with your settings\n# At minimum, update the llm_provider section with your API details\nnano config/default.yaml  # or use your preferred editor\n```\n\n2. **Set your API key (recommended via environment variable):**\n```bash\nexport OPENAI_API_KEY=\"your-api-key-here\"\n```\n\n3. **Convert a PDF:**\n```bash\npdf2markdown input.pdf -o output.md\n```\n\n## Library Usage\n\n`pdf2markdown` can be used as a Python library in your own applications. This is useful for integrating PDF conversion into larger systems, web applications, or data processing pipelines.\n\n### Simple Usage\n\n```python\nfrom pdf2markdown import PDFConverter\n\n# Create converter with default settings\nconverter = PDFConverter()\n\n# Convert a PDF to markdown\nmarkdown_text = converter.convert_sync(\"document.pdf\")\nprint(markdown_text)\n\n# Save to a file\nmarkdown_text = converter.convert_sync(\"document.pdf\", \"output.md\")\n```\n\n### Configuration Options\n\n```python\nfrom pdf2markdown import PDFConverter, ConfigBuilder\n\n# Build configuration programmatically\nconfig = ConfigBuilder() \\\n    .with_openai(api_key=\"your-api-key\", model=\"gpt-4o\") \\\n    .with_resolution(400) \\\n    .with_page_workers(20) \\\n    .with_cache_dir(\"/tmp/my_cache\") \\\n    .build()\n\nconverter = PDFConverter(config=config)\nmarkdown = converter.convert_sync(\"document.pdf\")\n```\n\n### Table Format Configuration\n\n```python\nfrom pdf2markdown import ConfigBuilder, PDFConverter\n\n# Configure for HTML tables (better for complex layouts)\nconfig = ConfigBuilder() \\\n    .with_openai(api_key=\"your-api-key\") \\\n    .build()\n\n# Set table format in the configuration\nconfig['page_parser']['table_format'] = 'html'  # Default\n\nconverter = PDFConverter(config=config)\n\n# Or configure for Markdown tables (simpler format)\nconfig['page_parser']['table_format'] = 'markdown'\n```\n\n### Using Different LLM Providers\n\n```python\nfrom pdf2markdown import ConfigBuilder, PDFConverter\n\n# OpenAI (or compatible endpoints)\nconfig = ConfigBuilder() \\\n    .with_openai(\n        api_key=\"your-key\",\n        model=\"gpt-4o-mini\",\n        endpoint=\"https://api.openai.com/v1\"  # or your custom endpoint\n    ) \\\n    .build()\n\n# Local models with Transformers\nconfig = ConfigBuilder() \\\n    .with_transformers(\n        model_name=\"microsoft/Phi-3.5-vision-instruct\",\n        device=\"cuda\",  # or \"cpu\"\n        torch_dtype=\"float16\"\n    ) \\\n    .build()\n\nconverter = PDFConverter(config=config)\n```\n\n### Async Usage\n\n```python\nimport asyncio\nfrom pdf2markdown import PDFConverter\n\nasync def convert_pdf():\n    converter = PDFConverter()\n    \n    # Async conversion\n    markdown = await converter.convert(\"document.pdf\")\n    \n    # With progress callback\n    async def progress(current, total, message):\n        print(f\"Progress: {current}/{total} - {message}\")\n    \n    markdown = await converter.convert(\n        \"document.pdf\",\n        progress_callback=progress\n    )\n    \n    return markdown\n\n# Run async function\nmarkdown = asyncio.run(convert_pdf())\n```\n\n### Streaming Pages\n\nProcess large documents page by page as they complete:\n\n```python\nimport asyncio\nfrom pdf2markdown import PDFConverter\n\nasync def stream_conversion():\n    converter = PDFConverter()\n    \n    async for page in converter.stream_pages(\"large_document.pdf\"):\n        print(f\"Page {page.page_number}: {len(page.content)} characters\")\n        # Process each page as it completes\n        # e.g., save to database, send to queue, etc.\n\nasyncio.run(stream_conversion())\n```\n\n### Batch Processing\n\nConvert multiple PDFs efficiently:\n\n```python\nimport asyncio\nfrom pdf2markdown import PDFConverter\n\nasync def batch_convert():\n    converter = PDFConverter()\n    \n    pdf_files = [\"doc1.pdf\", \"doc2.pdf\", \"doc3.pdf\"]\n    results = await converter.process_batch(\n        pdf_files,\n        output_dir=\"./output\"\n    )\n    \n    for result in results:\n        if result.status == ConversionStatus.COMPLETED:\n            print(f\"\u2713 {result.source_path}\")\n        else:\n            print(f\"\u2717 {result.source_path}: {result.error_message}\")\n\nasyncio.run(batch_convert())\n```\n\n### Loading Configuration from Files\n\n```python\nfrom pdf2markdown import PDFConverter, Config\n\n# From YAML file\nconfig = Config.from_yaml(\"config.yaml\")\nconverter = PDFConverter(config=config)\n\n# From dictionary\nconfig_dict = {\n    \"llm_provider\": {\n        \"provider_type\": \"openai\",\n        \"api_key\": \"your-key\",\n        \"model\": \"gpt-4o-mini\"\n    },\n    \"pipeline\": {\n        \"page_workers\": 15\n    }\n}\nconverter = PDFConverter(config=config_dict)\n```\n\n### Error Handling\n\n```python\nfrom pdf2markdown import (\n    PDFConverter,\n    PDFConversionError,\n    ConfigurationError,\n    ParsingError\n)\n\ntry:\n    converter = PDFConverter()\n    markdown = converter.convert_sync(\"document.pdf\")\nexcept ConfigurationError as e:\n    print(f\"Configuration error: {e}\")\nexcept ParsingError as e:\n    print(f\"Failed to parse PDF: {e}\")\n    if e.page_number:\n        print(f\"Error on page {e.page_number}\")\nexcept PDFConversionError as e:\n    print(f\"Conversion failed: {e}\")\n```\n\n### Context Manager\n\nProperly clean up resources using context managers:\n\n```python\nimport asyncio\nfrom pdf2markdown import PDFConverter\n\nasync def convert_with_cleanup():\n    async with PDFConverter() as converter:\n        markdown = await converter.convert(\"document.pdf\")\n        # Converter automatically cleaned up after this block\n    return markdown\n\nmarkdown = asyncio.run(convert_with_cleanup())\n```\n\n### Integration Examples\n\n#### Flask Web Application\n\n```python\nfrom flask import Flask, request, jsonify\nfrom pdf2markdown import PDFConverter\n\napp = Flask(__name__)\nconverter = PDFConverter()\n\n@app.route('/convert', methods=['POST'])\ndef convert_pdf():\n    if 'file' not in request.files:\n        return jsonify({'error': 'No file provided'}), 400\n    \n    file = request.files['file']\n    file.save('/tmp/upload.pdf')\n    \n    try:\n        markdown = converter.convert_sync('/tmp/upload.pdf')\n        return jsonify({'markdown': markdown})\n    except Exception as e:\n        return jsonify({'error': str(e)}), 500\n```\n\n#### Celery Task Queue\n\n```python\nfrom celery import Celery\nfrom pdf2markdown import PDFConverter\n\napp = Celery('tasks', broker='redis://localhost:6379')\nconverter = PDFConverter()\n\n@app.task\ndef convert_pdf_task(pdf_path):\n    \"\"\"Background task to convert PDF\"\"\"\n    return converter.convert_sync(pdf_path)\n```\n\n#### Document Processing Pipeline\n\n```python\nfrom pdf2markdown import PDFConverter, ConfigBuilder\nimport sqlite3\n\n# Configure for high-quality conversion\nconfig = ConfigBuilder() \\\n    .with_openai(api_key=\"your-key\", model=\"gpt-4o\") \\\n    .with_resolution(400) \\\n    .with_validators(['markdown', 'repetition']) \\\n    .build()\n\nconverter = PDFConverter(config=config)\n\ndef process_document(pdf_path, doc_id):\n    \"\"\"Process document and store in database\"\"\"\n    # Convert PDF\n    markdown = converter.convert_sync(pdf_path)\n    \n    # Store in database\n    conn = sqlite3.connect('documents.db')\n    cursor = conn.cursor()\n    cursor.execute(\n        \"INSERT INTO documents (id, content) VALUES (?, ?)\",\n        (doc_id, markdown)\n    )\n    conn.commit()\n    conn.close()\n    \n    return doc_id\n```\n\n## CLI Usage\n\n### Basic Usage\n\n```bash\n# Convert with default settings\npdf2markdown document.pdf\n\n# Specify output file\npdf2markdown document.pdf -o converted.md\n\n# Use a specific model\npdf2markdown document.pdf --model gpt-4o\n\n# Adjust rendering resolution\npdf2markdown document.pdf --resolution 400\n```\n\n### Advanced Usage\n\n```bash\n# Use custom configuration file\npdf2markdown document.pdf --config my-config.yaml\n\n# Parallel processing with more workers\npdf2markdown document.pdf --page-workers 20\n\n# Disable progress logging for automation\npdf2markdown document.pdf --no-progress\n\n# Save configuration for reuse\npdf2markdown document.pdf --save-config my-settings.yaml\n\n# Specify table format (html or markdown)\npdf2markdown document.pdf --table-format html  # For complex tables\npdf2markdown document.pdf --table-format markdown  # For simple tables\n```\n\n### Configuration\n\n#### Initial Setup\n\nThe application uses a YAML configuration file to manage settings. To get started:\n\n1. **Copy the sample configuration:**\n   ```bash\n   cp config/default.sample.yaml config/default.yaml\n   ```\n\n2. **Review and edit the configuration:**\n   The sample file (`config/default.sample.yaml`) is heavily documented with explanations for every setting. Key sections to configure:\n   - `llm_provider`: Your LLM API settings (endpoint, API key, model)\n   - `document_parser`: PDF rendering settings\n   - `pipeline`: Worker and processing settings\n\n3. **Set sensitive values via environment variables:**\n   Instead of hardcoding API keys in the config file, use environment variables:\n   ```bash\n   export OPENAI_API_KEY=\"your-api-key-here\"\n   ```\n   Then reference it in your config as: `${OPENAI_API_KEY}`\n\n#### Configuration File Structure\n\nHere's an overview of the configuration structure:\n\n```yaml\n# LLM Provider Configuration (required)\nllm_provider:\n  provider_type: openai  # Provider type (currently supports \"openai\")\n  endpoint: https://api.openai.com/v1  # API endpoint URL\n  api_key: ${OPENAI_API_KEY}  # Can reference environment variables\n  model: gpt-4o-mini  # Model to use\n  max_tokens: 4096  # Maximum tokens in response\n  temperature: 0.1  # Generation temperature (0.0-2.0)\n  timeout: 60  # Request timeout in seconds\n  \n  # Penalty parameters to reduce repetition (all optional)\n  presence_penalty: 0.0  # Penalize tokens based on presence (-2.0 to 2.0)\n  frequency_penalty: 0.0  # Penalize tokens based on frequency (-2.0 to 2.0)\n  repetition_penalty: null  # Alternative repetition penalty (0.0 to 2.0, some providers only)\n\n# Document Parser Configuration\ndocument_parser:\n  type: simple  # Parser type\n  resolution: 300  # DPI for rendering PDF pages to images\n  cache_dir: /tmp/pdf2markdown/cache  # Cache directory for rendered images\n  max_page_size: 50000000  # Maximum page size in bytes (50MB)\n  timeout: 30  # Timeout for rendering operations\n\n# Page Parser Configuration\npage_parser:\n  type: simple_llm  # Parser type\n  prompt_template: null  # Optional custom prompt template path\n  additional_instructions: null  # Optional additional LLM instructions\n  \n  # Table format configuration\n  table_format: html  # 'html' for complex layouts, 'markdown' for simple tables\n  \n  # Content validation pipeline configuration\n  validate_content: true  # Enable content validation\n  \n  validation:\n    # List of validators to run (in order)\n    validators: [\"markdown\", \"repetition\"]\n    \n    # Maximum number of correction attempts\n    max_correction_attempts: 2\n    \n    # Markdown validator - checks syntax and formatting\n    markdown:\n      enabled: true  # Enable this validator\n      attempt_correction: true  # Try to fix issues by re-prompting LLM\n      strict_mode: false  # Use relaxed mode for LLM-generated content\n      max_line_length: 1000  # Max line length (MD013 rule)\n      disabled_rules: []  # Additional rules to disable\n      enabled_rules: []  # Specific rules to enable\n      # Note: Common overly-strict rules are disabled by default including:\n      # MD033 (Inline HTML) - common in technical documents and tables\n      # MD026 (Trailing punctuation in headings) - common in PDF headings\n      # MD042 (No empty links) - LLMs may generate placeholder links during extraction\n      # MD041, MD022, MD031, MD032, MD025, MD024, MD013, MD047, MD040\n    \n    # Repetition validator - detects and corrects unwanted repetition\n    repetition:\n      enabled: true  # Enable this validator\n      attempt_correction: true  # Try to fix repetition issues\n      consecutive_threshold: 3  # Flag 3+ consecutive duplicate lines\n      window_size: 10  # Check within 10-line windows\n      window_threshold: 3  # Flag 3+ occurrences in window\n      check_exact_lines: true  # Check for exact duplicates\n      check_normalized_lines: true  # Check ignoring whitespace/punctuation\n      check_paragraphs: true  # Check for duplicate paragraphs\n      check_patterns: true  # Detect repetitive patterns\n      min_pattern_length: 20  # Minimum chars for pattern detection\n      pattern_similarity_threshold: 0.9  # Similarity threshold (0-1)\n      min_line_length: 5  # Minimum line length to check\n\n# Pipeline Configuration\npipeline:\n  document_workers: 1  # Must be 1 for sequential document processing\n  page_workers: 10  # Number of parallel page processing workers\n  queues:\n    document_queue_size: 100\n    page_queue_size: 1000\n    output_queue_size: 500\n  enable_progress: true  # Show progress bars\n  log_level: INFO  # Logging level\n\n# Output Configuration\noutput_dir: ./output  # Default output directory\ntemp_dir: /tmp/pdf2markdown  # Temporary file directory\npage_separator: \"\\n\\n--[PAGE: {page_number}]--\\n\\n\"  # Separator between pages\n```\n\n#### Configuration Hierarchy\n\nConfiguration values are loaded in the following order (later values override earlier ones):\n\n1. Default values in code\n2. Configuration file (`config/default.yaml` or file specified via `--config`)\n3. Environment variables\n4. Command-line arguments\n\n**Note:** The application looks for `config/default.yaml` in the current working directory by default. You can specify a different configuration file using the `--config` option:\n```bash\npdf2markdown input.pdf --config /path/to/my-config.yaml\n```\n\n#### LLM Provider Configuration\n\nThe `llm_provider` section is shared across all components that need LLM access. This centralized configuration makes it easy to:\n\n- Switch between different LLM providers\n- Use the same provider settings for multiple components\n- Override settings globally via environment variables or CLI\n\n**Supported Providers:**\n- `openai`: Any OpenAI-compatible API (OpenAI, Azure OpenAI, local servers with OpenAI-compatible endpoints)\n- `transformers`: Local models using HuggingFace Transformers (requires optional dependencies)\n\n**Future Providers (planned):**\n- `ollama`: Local models via Ollama\n- `anthropic`: Anthropic Claude API\n- `google`: Google Gemini API\n\n##### Penalty Parameters for Reducing Repetition\n\nTo avoid repetitive text in the generated markdown, you can configure penalty parameters:\n\n- **presence_penalty** (-2.0 to 2.0): Penalizes tokens that have already appeared in the text. Positive values discourage repetition.\n- **frequency_penalty** (-2.0 to 2.0): Penalizes tokens based on their frequency in the text so far. Positive values reduce repetition of common phrases.\n- **repetition_penalty** (0.0 to 2.0): Alternative parameter used by some providers (e.g., local models). Values > 1.0 reduce repetition.\n\n**Recommended settings for reducing repetition:**\n```yaml\nllm_provider:\n  presence_penalty: 0.5\n  frequency_penalty: 0.5\n  # OR for providers that use repetition_penalty:\n  repetition_penalty: 1.15\n```\n\n#### Custom OpenAI-Compatible Endpoints\n\nTo use a custom OpenAI-compatible endpoint (e.g., local LLM server, vLLM, etc.):\n\n```yaml\nllm_provider:\n  provider_type: openai\n  endpoint: http://localhost:8080/v1  # Your custom endpoint\n  api_key: dummy-key  # Some endpoints require a placeholder\n  model: your-model-name\n  max_tokens: 8192\n  temperature: 0.7\n  timeout: 120\n```\n\n#### Using Local Models with Transformers\n\nThe Transformers provider allows you to run models locally using HuggingFace Transformers. This is useful for:\n- Running without API costs\n- Processing sensitive documents locally\n- Using specialized models not available via APIs\n- Running on systems with GPU acceleration\n\n**Installation:**\n```bash\n# Install with transformers support\npip install -e \".[transformers]\"\n```\n\n**Configuration Example:**\n```yaml\nllm_provider:\n  provider_type: transformers\n  model_name: \"openbmb/MiniCPM-V-4\"  # HuggingFace model ID\n  device: \"auto\"  # or \"cuda\", \"cpu\", \"cuda:0\", etc.\n  torch_dtype: \"bfloat16\"  # or \"float16\", \"float32\", \"auto\"\n  max_tokens: 4096\n  temperature: 0.1\n  do_sample: false\n  \n  # Optional: Use 4-bit quantization to save memory\n  load_in_4bit: true\n  \n  # Optional: For models with .chat() method\n  use_chat_method: true\n```\n\n**Supported Models (examples):**\n- **MiniCPM-V series**: `openbmb/MiniCPM-V-4`, `openbmb/MiniCPM-V-2_6`\n- **Nanonets OCR**: `nanonets/Nanonets-OCR-s`\n- **Other vision models**: Any model supporting image-text-to-text generation\n\n**Performance Tips:**\n- Use `load_in_4bit: true` or `load_in_8bit: true` to reduce memory usage\n- Set `page_workers: 1` in pipeline config for local models (they use more memory)\n- Use `device_map: \"auto\"` for multi-GPU systems\n- Consider using `attn_implementation: \"flash_attention_2\"` for faster inference (if supported)\n\nSee `config/transformers_example.yaml` for a complete configuration example.\n\n## Environment Variables\n\n### LLM Provider Variables\n- `OPENAI_API_KEY`: Your OpenAI API key (required)\n- `OPENAI_API_ENDPOINT`: Custom API endpoint URL (optional)\n- `OPENAI_MODEL`: Model to use (default: gpt-4o-mini)\n\n### Application Variables\n- `PDF2MARKDOWN_CACHE_DIR`: Cache directory for rendered images\n- `PDF2MARKDOWN_OUTPUT_DIR`: Default output directory\n- `PDF2MARKDOWN_LOG_LEVEL`: Logging level (DEBUG, INFO, WARNING, ERROR)\n- `PDF2MARKDOWN_TEMP_DIR`: Temporary file directory\n\n## How It Works\n\n1. **Document Parsing**: PDF pages are rendered as high-resolution images using PyMuPDF\n2. **LLM Provider**: The configured LLM provider handles communication with the AI model\n3. **Image Processing**: Each page image is sent to the LLM with vision capabilities\n4. **Content Extraction**: The LLM extracts and formats content as Markdown\n5. **Validation Pipeline**: Content passes through multiple validators:\n   - **Markdown Validator**: Checks syntax and formatting\n   - **Repetition Validator**: Detects unwanted repetition patterns\n6. **Correction** (optional): If issues are found, the LLM is re-prompted with specific instructions to fix them\n7. **Assembly**: Processed pages are combined into a single Markdown document\n\n### Architecture Overview\n\nThe application uses a modular architecture with these key components:\n\n- **LLM Provider**: Abstraction layer for different LLM services (OpenAI, local models, etc.)\n- **Document Parser**: Converts PDF pages to images\n- **Page Parser**: Converts images to Markdown using LLM\n- **Validation Pipeline**: Extensible system with multiple validators:\n  - **Markdown Validator**: Validates and corrects syntax issues\n  - **Repetition Validator**: Detects and corrects unwanted repetition\n  - Easily extensible for additional validators\n- **Pipeline**: Orchestrates the conversion process with parallel workers\n- **Queue System**: Manages work distribution across workers\n\n## Output Format\n\nThe converter preserves:\n- **Headers**: Converted to appropriate Markdown heading levels\n- **Tables**: Rendered as HTML tables or Markdown tables (configurable)\n- **Lists**: Both ordered and unordered lists\n- **Equations**: LaTeX format for mathematical expressions ($inline$ and $$display$$)\n- **Images**: Descriptions or captions preserved\n- **Formatting**: Bold, italic, code, and other text styling\n- **Technical Elements**: Pin diagrams, electrical characteristics, timing specifications\n- **Special Notations**: Notes, warnings, footnotes, and cross-references\n\n### Table Format Options\n\nThe converter supports two table formats, configurable via the `table_format` setting:\n\n#### HTML Tables (Default)\nHTML tables are recommended for complex layouts with:\n- Merged cells (colspan/rowspan)\n- Nested tables\n- Complex alignments\n- Multi-line cell content\n\nExample configuration:\n```yaml\npage_parser:\n  table_format: html  # Default setting\n```\n\nOutput example:\n```html\n<table>\n  <thead>\n    <tr>\n      <th rowspan=\"2\">Parameter</th>\n      <th colspan=\"3\">Conditions</th>\n      <th>Unit</th>\n    </tr>\n    <tr>\n      <th>Min</th>\n      <th>Typ</th>\n      <th>Max</th>\n      <th></th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <td>Operating Voltage</td>\n      <td>1.7</td>\n      <td>3.3</td>\n      <td>3.6</td>\n      <td>V</td>\n    </tr>\n  </tbody>\n</table>\n```\n\n#### Markdown Tables\nMarkdown tables are simpler and more readable in plain text, best for:\n- Simple tabular data\n- Tables without merged cells\n- Basic alignment needs\n\nExample configuration:\n```yaml\npage_parser:\n  table_format: markdown\n```\n\nOutput example:\n```markdown\n| Parameter | Min | Typ | Max | Unit |\n|-----------|----:|----:|----:|------|\n| Voltage   | 1.7 | 3.3 | 3.6 | V    |\n| Current   | 0.1 | 0.5 | 1.0 | mA   |\n```\n\n### Output Quality\n\nThe converter ensures high-quality output through multiple mechanisms:\n\n#### Output Purity\n- Outputs **ONLY** the content from the PDF document\n- No explanatory text or comments\n- No \"Here is the content\" preambles\n- No additional formatting suggestions\n- Automatically removes markdown code fences if LLM wraps output\n- Just clean, accurate Markdown representing the original document\n\n#### Validation Pipeline\n- **Syntax Validation**: Ensures proper markdown formatting\n- **Repetition Detection**: Identifies and corrects various types of repetition:\n  - Consecutive duplicate lines\n  - Near-duplicates within sliding windows\n  - Duplicate paragraphs\n  - Repetitive patterns\n- **Extensible System**: Easy to add custom validators for specific needs\n\n### Page Separation\n\nPages are separated using a configurable separator (default: `--[PAGE: N]--`). You can customize this in the configuration:\n```yaml\n# Examples of page separators:\npage_separator: \"\\n---\\n\"                           # Simple horizontal rule\npage_separator: \"\\n\\n<!-- Page {page_number} -->\\n\\n\"  # HTML comment (invisible)\npage_separator: \"\\n\\n# Page {page_number}\\n\\n\"         # Markdown heading\npage_separator: \"\\n\\n--[PAGE: {page_number}]--\\n\\n\"    # Default format\n```\n\n## Performance\n\n- Processes pages in parallel (default: 10 workers)\n- Automatic caching of rendered images\n- Typical processing: 5-10 seconds per page\n\n## Requirements\n\n- Python 3.10+\n- OpenAI API key (or compatible endpoint)\n- System dependencies for PyMuPDF\n\n## Configuration Examples\n\n### Using Azure OpenAI\n\n```yaml\nllm_provider:\n  provider_type: openai\n  endpoint: https://your-resource.openai.azure.com/\n  api_key: ${AZURE_OPENAI_KEY}\n  model: gpt-4-vision\n  max_tokens: 4096\n```\n\n### Using Local LLM Server\n\n```yaml\nllm_provider:\n  provider_type: openai\n  endpoint: http://localhost:11434/v1  # Ollama with OpenAI compatibility\n  api_key: not-needed\n  model: llava:13b\n  max_tokens: 8192\n  timeout: 300  # Longer timeout for local models\n  # Many local servers use repetition_penalty instead\n  repetition_penalty: 1.15\n```\n\n### High-Performance Configuration\n\n```yaml\nllm_provider:\n  provider_type: openai\n  endpoint: https://api.openai.com/v1\n  api_key: ${OPENAI_API_KEY}\n  model: gpt-4o\n  max_tokens: 8192\n  temperature: 0.1\n  # Reduce repetition for better quality output\n  presence_penalty: 0.5\n  frequency_penalty: 0.5\n\npipeline:\n  page_workers: 20  # More parallel workers for faster processing\n\ndocument_parser:\n  resolution: 400  # Higher quality images\n```\n\n## Troubleshooting\n\n### API Key Issues\n```bash\n# Verify API key is set\necho $OPENAI_API_KEY\n\n# Set in .env file\necho \"OPENAI_API_KEY=your-key\" > .env\n\n# Check configuration\npdf2markdown document.pdf --save-config debug-config.yaml\n# Then inspect debug-config.yaml\n```\n\n### Memory Issues\n```bash\n# Reduce worker count\npdf2markdown large.pdf --page-workers 5\n\n# Lower resolution\npdf2markdown large.pdf --resolution 200\n```\n\n### Debugging\n```bash\n# Enable debug logging\npdf2markdown document.pdf --log-level DEBUG\n\n# Check cache directory\nls /tmp/pdf2markdown/cache/\n```\n\n## Development\n\n### Running Tests\n```bash\nhatch run test\n```\n\n### Code Formatting\n```bash\nhatch run format\n```\n\n### Type Checking\n```bash\nhatch run typecheck\n```\n\n## License\n\nMIT License - see LICENSE file for details\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Python library and CLI tool that leverages LLMs to convert technical PDF documents to well-structured Markdown",
    "version": "0.2.0",
    "project_urls": {
        "Documentation": "https://github.com/juanqui/pdf2markdown#readme",
        "Issues": "https://github.com/juanqui/pdf2markdown/issues",
        "Source": "https://github.com/juanqui/pdf2markdown"
    },
    "split_keywords": [
        "ai",
        " conversion",
        " document-conversion",
        " document-processing",
        " gpt",
        " llm",
        " machine-learning",
        " markdown",
        " ocr",
        " openai",
        " pdf",
        " pdf-parser",
        " pdf-to-text",
        " transformers",
        " vision"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "5afd089860d197200c53e960d6568dba156654f13127b65f6327cd46eb4ff6a3",
                "md5": "0aa9bfa87e94761563269663a0c7302f",
                "sha256": "48fee172c1331d2828d442a6b67adbf8537e71fa9d7f8bbfa860d0ab25760737"
            },
            "downloads": -1,
            "filename": "pdf2markdown-0.2.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "0aa9bfa87e94761563269663a0c7302f",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 80875,
            "upload_time": "2025-08-17T20:03:06",
            "upload_time_iso_8601": "2025-08-17T20:03:06.932826Z",
            "url": "https://files.pythonhosted.org/packages/5a/fd/089860d197200c53e960d6568dba156654f13127b65f6327cd46eb4ff6a3/pdf2markdown-0.2.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "391867c2189a8957ae328749122babc50f6ddaaed92fa0447140fe210b76b624",
                "md5": "46e07fceba7ddfa0198833ca476e4280",
                "sha256": "b9a9ac4eb162d0c40c8926c65cc8656ce6b2fbffb8ef88498b7ebcc370f73b11"
            },
            "downloads": -1,
            "filename": "pdf2markdown-0.2.0.tar.gz",
            "has_sig": false,
            "md5_digest": "46e07fceba7ddfa0198833ca476e4280",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 117700,
            "upload_time": "2025-08-17T20:03:08",
            "upload_time_iso_8601": "2025-08-17T20:03:08.487206Z",
            "url": "https://files.pythonhosted.org/packages/39/18/67c2189a8957ae328749122babc50f6ddaaed92fa0447140fe210b76b624/pdf2markdown-0.2.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-17 20:03:08",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "juanqui",
    "github_project": "pdf2markdown#readme",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "pdf2markdown"
}
        
Elapsed time: 1.26625s