pdf-to-xls-vision


Namepdf-to-xls-vision JSON
Version 1.0.4 PyPI version JSON
download
home_pagehttps://github.com/yourusername/pdf-to-xls-vision
SummaryConvert PDF and image tables to Excel using Claude Vision API with automatic validation and multi-page merging
upload_time2025-10-29 21:15:56
maintainerNone
docs_urlNone
authorYour Name
requires_python>=3.7
licenseMIT
keywords pdf excel conversion vision-api claude anthropic ocr table-extraction
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # PDF to XLS Vision

An intelligent Python library to convert PDF files containing tables into Excel (XLSX) files using Claude Vision API with automatic rotation detection. Each table found in the PDF becomes a separate sheet in the output Excel file.

## Features

- **Automatic PDF type detection** - Intelligently detects text-based vs image-based PDFs
- **Rotation detection & correction** - Automatically detects and corrects rotated pages (90°, 180°, 270°)
- **Dual extraction modes:**
  - Text-based PDFs: Fast, direct extraction (free, no API needed)
  - Image-based PDFs: Claude Vision API with superior accuracy
- **Quality validation** - Automatically detects poor extraction quality and retries with Vision API
- **Multi-page table merging** - Automatically merges tables that span multiple pages into single continuous tables
- **Automatic data validation** - Compares extracted numbers with source PDF and generates detailed Markdown reports
- **Improved OCR accuracy** - 4x resolution rendering and enhanced Vision API prompts for better character recognition
- **Incremental saving** - Saves progress every 10 pages for large PDFs
- **Batch processing** - Process entire directories with recursive scanning
- **Python library & CLI** - Use as a library in your code or as a command-line tool
- **Image file support** - Process image files (.jpg, .jpeg, .png, .tiff, .tif) directly

## Requirements

- Python 3.7+
- Anthropic API key (for image-based PDFs)

## Installation

### Install from PyPI (Recommended)

The easiest way to install:

```bash
pip install pdf-to-xls-vision
```

### Install from Source (for development)

```bash
# Clone the repository
git clone https://github.com/yourusername/pdf-to-xls-vision.git
cd pdf-to-xls-vision

# Install in development mode
pip install -e .
```

### Configuration

Set up your configuration:

1. Copy `.env.sample` to `.env`:
   ```bash
   cp .env.sample .env
   ```

2. Get your API key from: https://console.anthropic.com/

3. Edit the `.env` file and replace `your-api-key-here` with your actual API key:
   ```
   ANTHROPIC_API_KEY=sk-ant-your-actual-key-here
   ```

4. (Optional) Choose a different Claude model:
   ```
   CLAUDE_MODEL=claude-sonnet-4-5-20250929
   ```

   Available models:
   - `claude-sonnet-4-5-20250929` (default, most accurate)
   - `claude-3-5-sonnet-20241022` (fast, cost-effective)
   - `claude-3-5-sonnet-20240620` (balanced)
   - `claude-3-opus-20240229` (highest quality, slower)

## Usage

### As a Python Library

```python
from pdf_to_xls import convert_pdf_to_excel, batch_convert_directory

# Convert a single PDF
# Outputs: output.xlsx and output_validation.md
convert_pdf_to_excel('input.pdf', output_path='output.xlsx')

# Batch convert a directory
batch_convert_directory('pdfs/', output_dir='excel_files/', recursive=True)

# Force Vision API for complex tables
convert_pdf_to_excel('complex_table.pdf', force_vision=True)

# Convert image files directly
convert_pdf_to_excel('scanned_table.jpg', output_path='output.xlsx')

# Use custom API key and model
convert_pdf_to_excel(
    'input.pdf',
    api_key='your-api-key',
    model_name='claude-3-5-sonnet-20241022'
)
```

### Output Files

Each conversion generates two files:
- **{filename}.xlsx** - Excel file with extracted tables
- **{filename}_validation.md** - Markdown validation report (for text-based PDFs)

See the [examples/](examples/) directory for more usage examples:
- [basic_usage.py](examples/basic_usage.py) - Simple conversion examples
- [batch_processing.py](examples/batch_processing.py) - Batch processing examples
- [advanced_usage.py](examples/advanced_usage.py) - Advanced features and error handling

### As a Command-Line Tool

After installation, you can use the `pdf-to-xls` command:

#### Convert a Single PDF File

```bash
pdf-to-xls input.pdf
```

Output will be saved as `input.xlsx` in the same directory.

#### Specify Output Path

```bash
pdf-to-xls input.pdf -o output.xlsx
```

#### Convert All PDFs in a Directory

```bash
pdf-to-xls /path/to/pdfs
```

#### Batch Convert with Recursive Scanning

```bash
pdf-to-xls /path/to/pdfs -r -o /path/to/output
```

#### Force Vision API

```bash
pdf-to-xls input.pdf --force-vision
```

### CLI Examples

Convert all PDFs in a directory:
```bash
pdf-to-xls "pdfs/OpStmts" -r
```

Convert a single file:
```bash
pdf-to-xls "pdfs/OpStmts/1206.pdf"
```

## How It Works

1. **Detection Phase**: Analyzes the PDF to determine if it's text-based or image-based
2. **Text-based PDFs**: Uses fast, free pdfplumber extraction with quality validation
3. **Image-based PDFs**:
   - Converts each page to high-resolution image (4x zoom)
   - Detects rotation using Tesseract OSD
   - Corrects rotation if needed
   - Extracts tables using Claude Vision API with accuracy-focused prompts
   - Saves progress every 10 pages
4. **Quality Check**: If text extraction has quality issues, automatically retries with Vision API
5. **Multi-page Merging**: Automatically detects and merges tables spanning multiple pages
6. **Validation**: Compares extracted numbers with source PDF and generates detailed Markdown report
7. **Output**: Creates an Excel file with merged tables and validation report

## Rotation Detection

The converter automatically detects and corrects rotated pages:
- Supports 90°, 180°, 270° rotations
- Uses Tesseract OSD (Orientation and Script Detection)
- Only corrects when confidence > 1.0
- Logs each rotation correction

Example output:
```
Processing page 2/31 with Claude Vision...
  Detected rotation 270° (confidence: 5.3) - correcting
  ✓ Extracted table: 23 rows x 15 columns
```

## Large PDF Support

For PDFs with 30+ pages:
- Progress is saved incrementally every 10 pages
- If interrupted, partial results are preserved
- Visual progress indicators show completion status

Example:
```
Processing page 10/31...
💾 Saving progress... (10/31 pages processed)
✓ Progress saved: 10 tables
```

## Data Validation Report

For text-based PDFs, a validation report is automatically generated to help verify accuracy:

```markdown
# Data Validation Report

## Summary
| Metric | Count |
|--------|-------|
| Total numbers in PDF | 1,214 |
| Total numbers in tables | 1,382 |
| Matching numbers | 901 |
| **Accuracy** | **74.22%** |

## ⚠️ Numbers in PDF but Missing/Undercounted in Tables
| Number | PDF Count | Table Count | Difference |
|--------|-----------|-------------|------------|
|  6100.0 |         1 |           0 |          1 |
...
```

**What it tells you:**
- Overall accuracy percentage
- Numbers that may have been misread by OCR
- Numbers that appear different counts in PDF vs tables
- Helps you focus on the critical 5% that needs manual review

**How to use:**
1. Check the accuracy percentage
2. Review flagged numbers in the Excel output
3. Cross-reference with source PDF
4. Correct any discrepancies

## API Reference

### Main Functions

#### `convert_pdf_to_excel(pdf_path, output_path=None, output_dir=None, save_every=10, force_vision=False, api_key=None, model_name=None)`

Convert a single PDF or image file to Excel.

**Parameters:**
- `pdf_path` (str|Path): Path to PDF or image file (.pdf, .jpg, .jpeg, .png, .tiff, .tif)
- `output_path` (str|Path, optional): Output Excel file path
- `output_dir` (str|Path, optional): Output directory
- `save_every` (int): Save progress every N pages (default: 10)
- `force_vision` (bool): Force Vision API even for text PDFs (default: False)
- `api_key` (str, optional): Anthropic API key (uses env var if not provided)
- `model_name` (str, optional): Claude model name (uses env var if not provided)

**Returns:** Path to created Excel file, or None if no tables found

**Outputs:**
- `{filename}.xlsx` - Excel file with extracted tables
- `{filename}_validation.md` - Validation report (text-based PDFs only)

**Raises:**
- `FileNotFoundError`: If file does not exist
- `ValueError`: If API key is required but not found

#### `batch_convert_directory(input_dir, output_dir=None, recursive=False, force_vision=False, api_key=None, model_name=None)`

Batch convert PDFs in a directory.

**Parameters:**
- `input_dir` (str|Path): Directory containing PDF files
- `output_dir` (str|Path, optional): Output directory
- `recursive` (bool): Recursively search subdirectories (default: False)
- `force_vision` (bool): Force Vision API for all PDFs (default: False)
- `api_key` (str, optional): Anthropic API key
- `model_name` (str, optional): Claude model name

**Returns:** Dictionary with 'success' and 'failed' lists of file paths

**Raises:**
- `FileNotFoundError`: If input directory does not exist

### Utility Functions

#### `pdf_is_image_based(pdf_path)`

Check if PDF is image-based (contains images).

**Parameters:**
- `pdf_path` (str|Path): Path to PDF file

**Returns:** bool - True if PDF is image-based

#### `pdf_has_text(pdf_path)`

Check if PDF has extractable text.

**Parameters:**
- `pdf_path` (str|Path): Path to PDF file

**Returns:** bool - True if PDF has extractable text

#### `detect_quality_issues(table_data)`

Detect quality issues in extracted table data.

**Parameters:**
- `table_data`: DataFrame or raw table data

**Returns:** list - List of quality issue descriptions

## Cost Information

- **Text-based PDFs**: Free (no API calls)
- **Image-based PDFs**: ~$0.01-0.05 per page with Claude Vision API
- The tool automatically chooses the most cost-effective method

## Troubleshooting

**PDF not converting properly?**
- The tool automatically detects and uses the best method
- Check that your `.env` file has a valid API key for image-based PDFs
- Make sure the PDF isn't password-protected
- Try `--force-vision` flag for complex table layouts

**Process taking too long?**
- Large image-based PDFs (30+ pages) may take 15-25 minutes
- Progress is saved every 10 pages
- Check for incremental save messages

**Rotation issues?**
- Rotation detection requires Tesseract OCR to be installed
- Install via: `brew install tesseract` (Mac) or `apt-get install tesseract-ocr` (Linux)

**Import errors?**
- Make sure you installed the package: `pip install -e .`
- Check that all dependencies are installed: `pip install -r requirements.txt`

## Development

### Project Structure

```
pdf-to-xls-vision/
├── pdf_to_xls/              # Main package
│   ├── __init__.py         # Public API
│   ├── config.py           # Configuration management
│   ├── converter.py        # Main conversion functions
│   ├── data_cleaning.py    # Data cleaning utilities
│   ├── excel_writer.py     # Excel generation
│   ├── image_processing.py # Image conversion and rotation
│   ├── pdf_detection.py    # PDF type detection
│   ├── quality_check.py    # Quality validation
│   └── table_extraction.py # Table extraction (vision & text)
├── examples/               # Usage examples
│   ├── basic_usage.py
│   ├── batch_processing.py
│   └── advanced_usage.py
├── pdf_to_xls_cli.py      # CLI entry point
├── setup.py               # Package setup
├── pyproject.toml         # Modern Python packaging
├── requirements.txt       # Dependencies
├── README.md             # This file
└── LICENSE               # License file
```

### Running Tests

```bash
# Install development dependencies
pip install -e ".[dev]"

# Run tests (when test suite is added)
pytest
```

## Table Structure

Extracted tables use a simple, consistent structure:

| Row_Type | Category | 2020 | 2019 | ... |
|----------|----------|------|------|-----|
| HEADER | REVENUES | | | |
| DETAIL | Gross rental income | 458,963 | 452,477 | |
| DETAIL | Vacancy loss | (21,862) | (18,065) | |
| ROLLUP | Total revenues | 421,934 | 408,059 | |

**Row Types:**
- `HEADER` - Section/category headers
- `DETAIL` - Individual line items
- `ROLLUP` - Total/summary rows

**Multi-page Tables:**
Tables that span multiple pages are automatically detected and merged into a single continuous table.

## Technical Details

- Uses `pdfplumber` for text extraction
- Uses `pytesseract` for rotation detection
- Uses Claude Vision API (Sonnet 4.5) for image-based extraction
- Uses `openpyxl` for Excel file generation
- 4x resolution rendering (3368x2380 pixels) for optimal OCR accuracy
- Automatic quality validation and retry logic
- Automatic multi-page table continuation detection and merging
- Post-extraction number validation and discrepancy reporting

## License

MIT License - see [LICENSE](LICENSE) file for details

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## Accuracy and Limitations

**Expected Accuracy:**
- Text-based PDFs with simple tables: ~95-99%
- Image-based PDFs with complex tables: ~85-95%
- Wide tables (12+ columns) with small text: ~70-90%

**Known Limitations:**
1. **OCR Errors**: Vision API may misread similar characters (6 vs 8, O vs 0)
2. **Complex Layouts**: Tables with merged cells or irregular structures may not extract perfectly
3. **Image Quality**: Low-resolution source PDFs reduce accuracy
4. **Text-only Validation**: Validation reports only work for text-based PDFs

**Best Practices:**
- ✅ Always review the validation report
- ✅ Manually verify critical numbers (especially financial data)
- ✅ Use high-quality source PDFs when possible
- ✅ For mission-critical accuracy, consider human verification of flagged numbers

## Changelog

### Version 1.0.4
- **Multi-page table merging** - Automatically detects and merges continuation tables
- **Data validation reports** - Generates Markdown reports comparing PDF vs extracted numbers
- **Improved OCR accuracy** - 4x resolution rendering, enhanced Vision API prompts
- **Single Category column** - Simplified table structure for easier downstream processing
- **Generic header detection** - Supports both "Col1" and "Column1" header patterns
- **Debug logging** - Added image size tracking for troubleshooting
- **Image file support** - Process .jpg, .jpeg, .png, .tiff, .tif files directly

### Version 1.0.3
- Fix image size limit error for Claude API

### Version 1.0.2
- Add support for image file inputs

### Version 1.0.1
- Bug fixes and improvements

### Version 1.0.0
- Initial release with library structure
- Modular package design
- Python library API
- Command-line interface
- Automatic PDF type detection
- Vision API with rotation correction
- Quality validation and auto-retry
- Batch processing support
- Comprehensive examples

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/yourusername/pdf-to-xls-vision",
    "name": "pdf-to-xls-vision",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": null,
    "keywords": "pdf, excel, conversion, vision-api, claude, anthropic, ocr, table-extraction",
    "author": "Your Name",
    "author_email": "Your Name <your.email@example.com>",
    "download_url": "https://files.pythonhosted.org/packages/7f/f9/b984166b1138223300f5c3aa8569543f2d3da1d5489a299630759f353b86/pdf_to_xls_vision-1.0.4.tar.gz",
    "platform": null,
    "description": "# PDF to XLS Vision\n\nAn intelligent Python library to convert PDF files containing tables into Excel (XLSX) files using Claude Vision API with automatic rotation detection. Each table found in the PDF becomes a separate sheet in the output Excel file.\n\n## Features\n\n- **Automatic PDF type detection** - Intelligently detects text-based vs image-based PDFs\n- **Rotation detection & correction** - Automatically detects and corrects rotated pages (90\u00b0, 180\u00b0, 270\u00b0)\n- **Dual extraction modes:**\n  - Text-based PDFs: Fast, direct extraction (free, no API needed)\n  - Image-based PDFs: Claude Vision API with superior accuracy\n- **Quality validation** - Automatically detects poor extraction quality and retries with Vision API\n- **Multi-page table merging** - Automatically merges tables that span multiple pages into single continuous tables\n- **Automatic data validation** - Compares extracted numbers with source PDF and generates detailed Markdown reports\n- **Improved OCR accuracy** - 4x resolution rendering and enhanced Vision API prompts for better character recognition\n- **Incremental saving** - Saves progress every 10 pages for large PDFs\n- **Batch processing** - Process entire directories with recursive scanning\n- **Python library & CLI** - Use as a library in your code or as a command-line tool\n- **Image file support** - Process image files (.jpg, .jpeg, .png, .tiff, .tif) directly\n\n## Requirements\n\n- Python 3.7+\n- Anthropic API key (for image-based PDFs)\n\n## Installation\n\n### Install from PyPI (Recommended)\n\nThe easiest way to install:\n\n```bash\npip install pdf-to-xls-vision\n```\n\n### Install from Source (for development)\n\n```bash\n# Clone the repository\ngit clone https://github.com/yourusername/pdf-to-xls-vision.git\ncd pdf-to-xls-vision\n\n# Install in development mode\npip install -e .\n```\n\n### Configuration\n\nSet up your configuration:\n\n1. Copy `.env.sample` to `.env`:\n   ```bash\n   cp .env.sample .env\n   ```\n\n2. Get your API key from: https://console.anthropic.com/\n\n3. Edit the `.env` file and replace `your-api-key-here` with your actual API key:\n   ```\n   ANTHROPIC_API_KEY=sk-ant-your-actual-key-here\n   ```\n\n4. (Optional) Choose a different Claude model:\n   ```\n   CLAUDE_MODEL=claude-sonnet-4-5-20250929\n   ```\n\n   Available models:\n   - `claude-sonnet-4-5-20250929` (default, most accurate)\n   - `claude-3-5-sonnet-20241022` (fast, cost-effective)\n   - `claude-3-5-sonnet-20240620` (balanced)\n   - `claude-3-opus-20240229` (highest quality, slower)\n\n## Usage\n\n### As a Python Library\n\n```python\nfrom pdf_to_xls import convert_pdf_to_excel, batch_convert_directory\n\n# Convert a single PDF\n# Outputs: output.xlsx and output_validation.md\nconvert_pdf_to_excel('input.pdf', output_path='output.xlsx')\n\n# Batch convert a directory\nbatch_convert_directory('pdfs/', output_dir='excel_files/', recursive=True)\n\n# Force Vision API for complex tables\nconvert_pdf_to_excel('complex_table.pdf', force_vision=True)\n\n# Convert image files directly\nconvert_pdf_to_excel('scanned_table.jpg', output_path='output.xlsx')\n\n# Use custom API key and model\nconvert_pdf_to_excel(\n    'input.pdf',\n    api_key='your-api-key',\n    model_name='claude-3-5-sonnet-20241022'\n)\n```\n\n### Output Files\n\nEach conversion generates two files:\n- **{filename}.xlsx** - Excel file with extracted tables\n- **{filename}_validation.md** - Markdown validation report (for text-based PDFs)\n\nSee the [examples/](examples/) directory for more usage examples:\n- [basic_usage.py](examples/basic_usage.py) - Simple conversion examples\n- [batch_processing.py](examples/batch_processing.py) - Batch processing examples\n- [advanced_usage.py](examples/advanced_usage.py) - Advanced features and error handling\n\n### As a Command-Line Tool\n\nAfter installation, you can use the `pdf-to-xls` command:\n\n#### Convert a Single PDF File\n\n```bash\npdf-to-xls input.pdf\n```\n\nOutput will be saved as `input.xlsx` in the same directory.\n\n#### Specify Output Path\n\n```bash\npdf-to-xls input.pdf -o output.xlsx\n```\n\n#### Convert All PDFs in a Directory\n\n```bash\npdf-to-xls /path/to/pdfs\n```\n\n#### Batch Convert with Recursive Scanning\n\n```bash\npdf-to-xls /path/to/pdfs -r -o /path/to/output\n```\n\n#### Force Vision API\n\n```bash\npdf-to-xls input.pdf --force-vision\n```\n\n### CLI Examples\n\nConvert all PDFs in a directory:\n```bash\npdf-to-xls \"pdfs/OpStmts\" -r\n```\n\nConvert a single file:\n```bash\npdf-to-xls \"pdfs/OpStmts/1206.pdf\"\n```\n\n## How It Works\n\n1. **Detection Phase**: Analyzes the PDF to determine if it's text-based or image-based\n2. **Text-based PDFs**: Uses fast, free pdfplumber extraction with quality validation\n3. **Image-based PDFs**:\n   - Converts each page to high-resolution image (4x zoom)\n   - Detects rotation using Tesseract OSD\n   - Corrects rotation if needed\n   - Extracts tables using Claude Vision API with accuracy-focused prompts\n   - Saves progress every 10 pages\n4. **Quality Check**: If text extraction has quality issues, automatically retries with Vision API\n5. **Multi-page Merging**: Automatically detects and merges tables spanning multiple pages\n6. **Validation**: Compares extracted numbers with source PDF and generates detailed Markdown report\n7. **Output**: Creates an Excel file with merged tables and validation report\n\n## Rotation Detection\n\nThe converter automatically detects and corrects rotated pages:\n- Supports 90\u00b0, 180\u00b0, 270\u00b0 rotations\n- Uses Tesseract OSD (Orientation and Script Detection)\n- Only corrects when confidence > 1.0\n- Logs each rotation correction\n\nExample output:\n```\nProcessing page 2/31 with Claude Vision...\n  Detected rotation 270\u00b0 (confidence: 5.3) - correcting\n  \u2713 Extracted table: 23 rows x 15 columns\n```\n\n## Large PDF Support\n\nFor PDFs with 30+ pages:\n- Progress is saved incrementally every 10 pages\n- If interrupted, partial results are preserved\n- Visual progress indicators show completion status\n\nExample:\n```\nProcessing page 10/31...\n\ud83d\udcbe Saving progress... (10/31 pages processed)\n\u2713 Progress saved: 10 tables\n```\n\n## Data Validation Report\n\nFor text-based PDFs, a validation report is automatically generated to help verify accuracy:\n\n```markdown\n# Data Validation Report\n\n## Summary\n| Metric | Count |\n|--------|-------|\n| Total numbers in PDF | 1,214 |\n| Total numbers in tables | 1,382 |\n| Matching numbers | 901 |\n| **Accuracy** | **74.22%** |\n\n## \u26a0\ufe0f Numbers in PDF but Missing/Undercounted in Tables\n| Number | PDF Count | Table Count | Difference |\n|--------|-----------|-------------|------------|\n|  6100.0 |         1 |           0 |          1 |\n...\n```\n\n**What it tells you:**\n- Overall accuracy percentage\n- Numbers that may have been misread by OCR\n- Numbers that appear different counts in PDF vs tables\n- Helps you focus on the critical 5% that needs manual review\n\n**How to use:**\n1. Check the accuracy percentage\n2. Review flagged numbers in the Excel output\n3. Cross-reference with source PDF\n4. Correct any discrepancies\n\n## API Reference\n\n### Main Functions\n\n#### `convert_pdf_to_excel(pdf_path, output_path=None, output_dir=None, save_every=10, force_vision=False, api_key=None, model_name=None)`\n\nConvert a single PDF or image file to Excel.\n\n**Parameters:**\n- `pdf_path` (str|Path): Path to PDF or image file (.pdf, .jpg, .jpeg, .png, .tiff, .tif)\n- `output_path` (str|Path, optional): Output Excel file path\n- `output_dir` (str|Path, optional): Output directory\n- `save_every` (int): Save progress every N pages (default: 10)\n- `force_vision` (bool): Force Vision API even for text PDFs (default: False)\n- `api_key` (str, optional): Anthropic API key (uses env var if not provided)\n- `model_name` (str, optional): Claude model name (uses env var if not provided)\n\n**Returns:** Path to created Excel file, or None if no tables found\n\n**Outputs:**\n- `{filename}.xlsx` - Excel file with extracted tables\n- `{filename}_validation.md` - Validation report (text-based PDFs only)\n\n**Raises:**\n- `FileNotFoundError`: If file does not exist\n- `ValueError`: If API key is required but not found\n\n#### `batch_convert_directory(input_dir, output_dir=None, recursive=False, force_vision=False, api_key=None, model_name=None)`\n\nBatch convert PDFs in a directory.\n\n**Parameters:**\n- `input_dir` (str|Path): Directory containing PDF files\n- `output_dir` (str|Path, optional): Output directory\n- `recursive` (bool): Recursively search subdirectories (default: False)\n- `force_vision` (bool): Force Vision API for all PDFs (default: False)\n- `api_key` (str, optional): Anthropic API key\n- `model_name` (str, optional): Claude model name\n\n**Returns:** Dictionary with 'success' and 'failed' lists of file paths\n\n**Raises:**\n- `FileNotFoundError`: If input directory does not exist\n\n### Utility Functions\n\n#### `pdf_is_image_based(pdf_path)`\n\nCheck if PDF is image-based (contains images).\n\n**Parameters:**\n- `pdf_path` (str|Path): Path to PDF file\n\n**Returns:** bool - True if PDF is image-based\n\n#### `pdf_has_text(pdf_path)`\n\nCheck if PDF has extractable text.\n\n**Parameters:**\n- `pdf_path` (str|Path): Path to PDF file\n\n**Returns:** bool - True if PDF has extractable text\n\n#### `detect_quality_issues(table_data)`\n\nDetect quality issues in extracted table data.\n\n**Parameters:**\n- `table_data`: DataFrame or raw table data\n\n**Returns:** list - List of quality issue descriptions\n\n## Cost Information\n\n- **Text-based PDFs**: Free (no API calls)\n- **Image-based PDFs**: ~$0.01-0.05 per page with Claude Vision API\n- The tool automatically chooses the most cost-effective method\n\n## Troubleshooting\n\n**PDF not converting properly?**\n- The tool automatically detects and uses the best method\n- Check that your `.env` file has a valid API key for image-based PDFs\n- Make sure the PDF isn't password-protected\n- Try `--force-vision` flag for complex table layouts\n\n**Process taking too long?**\n- Large image-based PDFs (30+ pages) may take 15-25 minutes\n- Progress is saved every 10 pages\n- Check for incremental save messages\n\n**Rotation issues?**\n- Rotation detection requires Tesseract OCR to be installed\n- Install via: `brew install tesseract` (Mac) or `apt-get install tesseract-ocr` (Linux)\n\n**Import errors?**\n- Make sure you installed the package: `pip install -e .`\n- Check that all dependencies are installed: `pip install -r requirements.txt`\n\n## Development\n\n### Project Structure\n\n```\npdf-to-xls-vision/\n\u251c\u2500\u2500 pdf_to_xls/              # Main package\n\u2502   \u251c\u2500\u2500 __init__.py         # Public API\n\u2502   \u251c\u2500\u2500 config.py           # Configuration management\n\u2502   \u251c\u2500\u2500 converter.py        # Main conversion functions\n\u2502   \u251c\u2500\u2500 data_cleaning.py    # Data cleaning utilities\n\u2502   \u251c\u2500\u2500 excel_writer.py     # Excel generation\n\u2502   \u251c\u2500\u2500 image_processing.py # Image conversion and rotation\n\u2502   \u251c\u2500\u2500 pdf_detection.py    # PDF type detection\n\u2502   \u251c\u2500\u2500 quality_check.py    # Quality validation\n\u2502   \u2514\u2500\u2500 table_extraction.py # Table extraction (vision & text)\n\u251c\u2500\u2500 examples/               # Usage examples\n\u2502   \u251c\u2500\u2500 basic_usage.py\n\u2502   \u251c\u2500\u2500 batch_processing.py\n\u2502   \u2514\u2500\u2500 advanced_usage.py\n\u251c\u2500\u2500 pdf_to_xls_cli.py      # CLI entry point\n\u251c\u2500\u2500 setup.py               # Package setup\n\u251c\u2500\u2500 pyproject.toml         # Modern Python packaging\n\u251c\u2500\u2500 requirements.txt       # Dependencies\n\u251c\u2500\u2500 README.md             # This file\n\u2514\u2500\u2500 LICENSE               # License file\n```\n\n### Running Tests\n\n```bash\n# Install development dependencies\npip install -e \".[dev]\"\n\n# Run tests (when test suite is added)\npytest\n```\n\n## Table Structure\n\nExtracted tables use a simple, consistent structure:\n\n| Row_Type | Category | 2020 | 2019 | ... |\n|----------|----------|------|------|-----|\n| HEADER | REVENUES | | | |\n| DETAIL | Gross rental income | 458,963 | 452,477 | |\n| DETAIL | Vacancy loss | (21,862) | (18,065) | |\n| ROLLUP | Total revenues | 421,934 | 408,059 | |\n\n**Row Types:**\n- `HEADER` - Section/category headers\n- `DETAIL` - Individual line items\n- `ROLLUP` - Total/summary rows\n\n**Multi-page Tables:**\nTables that span multiple pages are automatically detected and merged into a single continuous table.\n\n## Technical Details\n\n- Uses `pdfplumber` for text extraction\n- Uses `pytesseract` for rotation detection\n- Uses Claude Vision API (Sonnet 4.5) for image-based extraction\n- Uses `openpyxl` for Excel file generation\n- 4x resolution rendering (3368x2380 pixels) for optimal OCR accuracy\n- Automatic quality validation and retry logic\n- Automatic multi-page table continuation detection and merging\n- Post-extraction number validation and discrepancy reporting\n\n## License\n\nMIT License - see [LICENSE](LICENSE) file for details\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n## Accuracy and Limitations\n\n**Expected Accuracy:**\n- Text-based PDFs with simple tables: ~95-99%\n- Image-based PDFs with complex tables: ~85-95%\n- Wide tables (12+ columns) with small text: ~70-90%\n\n**Known Limitations:**\n1. **OCR Errors**: Vision API may misread similar characters (6 vs 8, O vs 0)\n2. **Complex Layouts**: Tables with merged cells or irregular structures may not extract perfectly\n3. **Image Quality**: Low-resolution source PDFs reduce accuracy\n4. **Text-only Validation**: Validation reports only work for text-based PDFs\n\n**Best Practices:**\n- \u2705 Always review the validation report\n- \u2705 Manually verify critical numbers (especially financial data)\n- \u2705 Use high-quality source PDFs when possible\n- \u2705 For mission-critical accuracy, consider human verification of flagged numbers\n\n## Changelog\n\n### Version 1.0.4\n- **Multi-page table merging** - Automatically detects and merges continuation tables\n- **Data validation reports** - Generates Markdown reports comparing PDF vs extracted numbers\n- **Improved OCR accuracy** - 4x resolution rendering, enhanced Vision API prompts\n- **Single Category column** - Simplified table structure for easier downstream processing\n- **Generic header detection** - Supports both \"Col1\" and \"Column1\" header patterns\n- **Debug logging** - Added image size tracking for troubleshooting\n- **Image file support** - Process .jpg, .jpeg, .png, .tiff, .tif files directly\n\n### Version 1.0.3\n- Fix image size limit error for Claude API\n\n### Version 1.0.2\n- Add support for image file inputs\n\n### Version 1.0.1\n- Bug fixes and improvements\n\n### Version 1.0.0\n- Initial release with library structure\n- Modular package design\n- Python library API\n- Command-line interface\n- Automatic PDF type detection\n- Vision API with rotation correction\n- Quality validation and auto-retry\n- Batch processing support\n- Comprehensive examples\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Convert PDF and image tables to Excel using Claude Vision API with automatic validation and multi-page merging",
    "version": "1.0.4",
    "project_urls": {
        "Bug Tracker": "https://github.com/yourusername/pdf-to-xls-vision/issues",
        "Documentation": "https://github.com/yourusername/pdf-to-xls-vision#readme",
        "Homepage": "https://github.com/yourusername/pdf-to-xls-vision",
        "Repository": "https://github.com/yourusername/pdf-to-xls-vision"
    },
    "split_keywords": [
        "pdf",
        " excel",
        " conversion",
        " vision-api",
        " claude",
        " anthropic",
        " ocr",
        " table-extraction"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "a1f171231216a0158bc8f62d805f140053b8023a33f942796b5489bc9a9e94da",
                "md5": "084cd7eebcb04089885b0d66b7dacf1b",
                "sha256": "4dadbadb04973be54064ce4421988863614bd01d46b4e651233baa9943a6cb70"
            },
            "downloads": -1,
            "filename": "pdf_to_xls_vision-1.0.4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "084cd7eebcb04089885b0d66b7dacf1b",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 31005,
            "upload_time": "2025-10-29T21:15:55",
            "upload_time_iso_8601": "2025-10-29T21:15:55.652979Z",
            "url": "https://files.pythonhosted.org/packages/a1/f1/71231216a0158bc8f62d805f140053b8023a33f942796b5489bc9a9e94da/pdf_to_xls_vision-1.0.4-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "7ff9b984166b1138223300f5c3aa8569543f2d3da1d5489a299630759f353b86",
                "md5": "6f4ceed6d9302d55bef79807d4c86cd3",
                "sha256": "e9971a12393d8a7a78dd4f4108607161d662be989770423e9531f0ed31da8ce1"
            },
            "downloads": -1,
            "filename": "pdf_to_xls_vision-1.0.4.tar.gz",
            "has_sig": false,
            "md5_digest": "6f4ceed6d9302d55bef79807d4c86cd3",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 36713,
            "upload_time": "2025-10-29T21:15:56",
            "upload_time_iso_8601": "2025-10-29T21:15:56.884636Z",
            "url": "https://files.pythonhosted.org/packages/7f/f9/b984166b1138223300f5c3aa8569543f2d3da1d5489a299630759f353b86/pdf_to_xls_vision-1.0.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-29 21:15:56",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "yourusername",
    "github_project": "pdf-to-xls-vision",
    "github_not_found": true,
    "lcname": "pdf-to-xls-vision"
}
        
Elapsed time: 0.51665s