tabuparse


Nametabuparse JSON
Version 0.1.0 PyPI version JSON
download
home_pageNone
SummaryA Python CLI tool for extracting, normalizing, and merging tabular data from PDF documents
upload_time2025-07-11 16:50:57
maintainerNone
docs_urlNone
authorNone
requires_python>=3.11
licenseMIT
keywords cli data engineering extraction parsing pdf table utility
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <div align="center">
    <img src="https://storage.googleapis.com/lupeke.dev/_tabuparse.png" alt="tabuparse" width="250" /><br />
    <p><b>extract, transform and export PDF tabular data</b></p>
    <p>
        <img src="https://img.shields.io/badge/python-3.10+-blue.svg" alt="Python 3.10+" />
        <img src="https://img.shields.io/badge/asyncio-ready-blueviolet" alt="asyncio support" />
        <img src="https://img.shields.io/badge/License-MIT-yellow.svg" alt="License: MIT" />
    </p>
</div>

## About
```tabuparse``` is a Python CLI tool and library for extracting, normalizing, and merging tabular data from PDF documents.

## Installation

> [!WARNING]
> This project is still in alpha mode and might go sideways.


### From source

```bash
git clone https://github.com/lupeke/tabuparse.git && \
cd tabuparse && \
python3 -m venv .venv && source .venv/bin/activate && \
pip install -e .
```

#### Run a health check
```bash
python tests/check_install.py
```

## Quick start

### CLI usage

```bash
# Process single PDF with default settings
tabuparse process example.pdf

# Process multiple PDFs with configuration
tabuparse process *.pdf --config settings.toml --output data.csv

# Export to SQLite with summary statistics
tabuparse process documents/*.pdf --format sqlite --summary

# Preview processing without extraction
tabuparse preview *.pdf --config settings.toml

# Extract from single PDF for testing
tabuparse extract document.pdf --pages "1-3" --flavor stream
```

### Library usage

```python
import asyncio
from tabuparse import process_pdfs

async def main():
    # Process PDFs and get merged DataFrame
    result_df = await process_pdfs(
        pdf_paths=['invoice1.pdf', 'invoice2.pdf'],
        config_path='schema.toml',
        output_format='csv'
    )

    print(f"Extracted {len(result_df)} rows")
    print(result_df.head())

asyncio.run(main())
```

## Configuration

```tabuparse``` uses TOML configuration files to define extraction parameters and expected schemas.

### generate sample configuration

```bash
tabuparse init-config settings.toml --columns "Invoice ID,Date,Amount,Description"
```

### configuration structure

```toml
# settings.toml
[table_structure]
expected_columns = [
    "Invoice ID",
    "Date",
    "Item Description",
    "Quantity",
    "Unit Price",
    "Total Amount"
]

[settings]
output_format = "csv"
strict_schema = false

[default_extraction]
flavor = "lattice"
pages = "all"

# PDF-specific extraction parameters
[[extraction_parameters]]
pdf_path = "invoice_batch_1.pdf"
pages = "1-5"
flavor = "lattice"

[[extraction_parameters]]
pdf_path = "statements.pdf"
pages = "all"
flavor = "stream"
table_areas = ["72,72,432,648"]  # left,bottom,right,top in points
```

### Configuration Options

#### table structure
- `expected_columns`: List of column names for schema normalization

#### settings
- `output_format`: "csv" or "sqlite"
- `strict_schema`: Enable strict schema validation (fail on mismatches)

#### extraction parameters
- `pages`: Page selection ("all", "1", "1,3,5", "1-3")
- `flavor`: Camelot extraction method ("lattice" or "stream")
- `table_areas`: Specific table regions to extract
- `pdf_path`: Apply parameters to specific PDF files

## CLI Commands

### `process`
Extract and merge tables from multiple PDF files.

```bash
tabuparse process file1.pdf file2.pdf [OPTIONS]

Options:
  -c, --config PATH       TOML configuration file
  -o, --output PATH       Output file path
  --format [csv|sqlite]   Output format (default: csv)
  --max-concurrent INT    Max concurrent extractions (default: 5)
  --summary              Export summary statistics
  --no-clean             Disable data cleaning
  --strict               Enable strict schema validation
```

### `extract`
Extract tables from a single PDF (for testing).

```bash
tabuparse extract document.pdf [OPTIONS]

Options:
  -c, --config PATH              Configuration file
  --pages TEXT                   Pages to extract
  --flavor [lattice|stream]      Extraction method
  --show-info                    Show detailed table information
```

### `preview`
Preview processing statistics without extraction.

```bash
tabuparse preview file1.pdf file2.pdf [OPTIONS]

Options:
  -c, --config PATH       Configuration file
```

### `init-config`
Generate sample configuration file.

```bash
tabuparse init-config config.toml [OPTIONS]

Options:
  --columns TEXT                 Expected column names (comma-separated)
  --format [csv|sqlite]          Default output format
  --flavor [lattice|stream]      Default extraction flavor
```

### `validate`
Validate PDF file compatibility.

```bash
tabuparse validate document.pdf
```

## Library API

### core functions

```python
from tabuparse import process_pdfs, extract_from_single_pdf

# Process multiple PDFs
result_df = await process_pdfs(
    pdf_paths=['file1.pdf', 'file2.pdf'],
    config_path='settings.toml',
    output_path='output.csv',
    output_format='csv',
    max_concurrent=5
)

# Extract from single PDF
tables = await extract_from_single_pdf(
    'document.pdf',
    config_path='settings.toml'
)
```

### configuration management

```python
from tabuparse.config_parser import parse_config, TabuparseConfig

# Load configuration
config = parse_config('settings.toml')

# Create programmatic configuration
config = TabuparseConfig(
    expected_columns=['ID', 'Name', 'Amount'],
    output_format='sqlite'
)
```

### data processing

```python
from tabuparse.data_processor import normalize_schema, merge_dataframes

# Normalize DataFrame schema
normalized_df = normalize_schema(
    df,
    expected_columns=['ID', 'Name', 'Amount'],
    strict_mode=False
)

# Merge multiple DataFrames
merged_df = merge_dataframes([df1, df2, df3])
```

## Examples

### basic invoice processing

```bash
# Process invoice PDFs with predefined schema
tabuparse process invoices/*.pdf --config invoice_schema.toml --output invoices.csv
```

### multi-format export

```python
import asyncio
from tabuparse import process_pdfs

async def process_financial_data():
    # Extract data
    df = await process_pdfs(
        pdf_paths=['q1_report.pdf', 'q2_report.pdf'],
        config_path='financial_schema.toml'
    )

    # Export to multiple formats
    df.to_csv('financial_data.csv', index=False)
    df.to_excel('financial_data.xlsx', index=False)

    return df

asyncio.run(process_financial_data())
```

### custom processing pipeline

```python
from tabuparse.pdf_extractor import extract_tables_from_pdf
from tabuparse.data_processor import normalize_schema
from tabuparse.output_writer import write_sqlite

async def custom_pipeline():
    # Extract tables
    tables = await extract_tables_from_pdf('document.pdf')

    # Process each table
    processed_tables = []
    for table in tables:
        normalized = normalize_schema(
            table,
            expected_columns=['ID', 'Date', 'Amount']
        )
        processed_tables.append(normalized)

    # Merge and export
    import pandas as pd
    merged = pd.concat(processed_tables, ignore_index=True)
    write_sqlite(merged, 'output.sqlite', table_name='extracted_data')

asyncio.run(custom_pipeline())
```



<br /><hr />
<a href="https://www.flaticon.com/free-icons/samplings" title="samplings icons">Samplings icons by Afian Rochmah Afif - Flaticon</a>

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "tabuparse",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.11",
    "maintainer_email": null,
    "keywords": "cli, data engineering, extraction, parsing, pdf, table, utility",
    "author": null,
    "author_email": "Daniel Dias <daniel@lupeke.dev>",
    "download_url": "https://files.pythonhosted.org/packages/a7/9a/bd698be5772364838f45088a4df2b831f46a50952c05c746d0ee7663dca3/tabuparse-0.1.0.tar.gz",
    "platform": null,
    "description": "<div align=\"center\">\n    <img src=\"https://storage.googleapis.com/lupeke.dev/_tabuparse.png\" alt=\"tabuparse\" width=\"250\" /><br />\n    <p><b>extract, transform and export PDF tabular data</b></p>\n    <p>\n        <img src=\"https://img.shields.io/badge/python-3.10+-blue.svg\" alt=\"Python 3.10+\" />\n        <img src=\"https://img.shields.io/badge/asyncio-ready-blueviolet\" alt=\"asyncio support\" />\n        <img src=\"https://img.shields.io/badge/License-MIT-yellow.svg\" alt=\"License: MIT\" />\n    </p>\n</div>\n\n## About\n```tabuparse``` is a Python CLI tool and library for extracting, normalizing, and merging tabular data from PDF documents.\n\n## Installation\n\n> [!WARNING]\n> This project is still in alpha mode and might go sideways.\n\n\n### From source\n\n```bash\ngit clone https://github.com/lupeke/tabuparse.git && \\\ncd tabuparse && \\\npython3 -m venv .venv && source .venv/bin/activate && \\\npip install -e .\n```\n\n#### Run a health check\n```bash\npython tests/check_install.py\n```\n\n## Quick start\n\n### CLI usage\n\n```bash\n# Process single PDF with default settings\ntabuparse process example.pdf\n\n# Process multiple PDFs with configuration\ntabuparse process *.pdf --config settings.toml --output data.csv\n\n# Export to SQLite with summary statistics\ntabuparse process documents/*.pdf --format sqlite --summary\n\n# Preview processing without extraction\ntabuparse preview *.pdf --config settings.toml\n\n# Extract from single PDF for testing\ntabuparse extract document.pdf --pages \"1-3\" --flavor stream\n```\n\n### Library usage\n\n```python\nimport asyncio\nfrom tabuparse import process_pdfs\n\nasync def main():\n    # Process PDFs and get merged DataFrame\n    result_df = await process_pdfs(\n        pdf_paths=['invoice1.pdf', 'invoice2.pdf'],\n        config_path='schema.toml',\n        output_format='csv'\n    )\n\n    print(f\"Extracted {len(result_df)} rows\")\n    print(result_df.head())\n\nasyncio.run(main())\n```\n\n## Configuration\n\n```tabuparse``` uses TOML configuration files to define extraction parameters and expected schemas.\n\n### generate sample configuration\n\n```bash\ntabuparse init-config settings.toml --columns \"Invoice ID,Date,Amount,Description\"\n```\n\n### configuration structure\n\n```toml\n# settings.toml\n[table_structure]\nexpected_columns = [\n    \"Invoice ID\",\n    \"Date\",\n    \"Item Description\",\n    \"Quantity\",\n    \"Unit Price\",\n    \"Total Amount\"\n]\n\n[settings]\noutput_format = \"csv\"\nstrict_schema = false\n\n[default_extraction]\nflavor = \"lattice\"\npages = \"all\"\n\n# PDF-specific extraction parameters\n[[extraction_parameters]]\npdf_path = \"invoice_batch_1.pdf\"\npages = \"1-5\"\nflavor = \"lattice\"\n\n[[extraction_parameters]]\npdf_path = \"statements.pdf\"\npages = \"all\"\nflavor = \"stream\"\ntable_areas = [\"72,72,432,648\"]  # left,bottom,right,top in points\n```\n\n### Configuration Options\n\n#### table structure\n- `expected_columns`: List of column names for schema normalization\n\n#### settings\n- `output_format`: \"csv\" or \"sqlite\"\n- `strict_schema`: Enable strict schema validation (fail on mismatches)\n\n#### extraction parameters\n- `pages`: Page selection (\"all\", \"1\", \"1,3,5\", \"1-3\")\n- `flavor`: Camelot extraction method (\"lattice\" or \"stream\")\n- `table_areas`: Specific table regions to extract\n- `pdf_path`: Apply parameters to specific PDF files\n\n## CLI Commands\n\n### `process`\nExtract and merge tables from multiple PDF files.\n\n```bash\ntabuparse process file1.pdf file2.pdf [OPTIONS]\n\nOptions:\n  -c, --config PATH       TOML configuration file\n  -o, --output PATH       Output file path\n  --format [csv|sqlite]   Output format (default: csv)\n  --max-concurrent INT    Max concurrent extractions (default: 5)\n  --summary              Export summary statistics\n  --no-clean             Disable data cleaning\n  --strict               Enable strict schema validation\n```\n\n### `extract`\nExtract tables from a single PDF (for testing).\n\n```bash\ntabuparse extract document.pdf [OPTIONS]\n\nOptions:\n  -c, --config PATH              Configuration file\n  --pages TEXT                   Pages to extract\n  --flavor [lattice|stream]      Extraction method\n  --show-info                    Show detailed table information\n```\n\n### `preview`\nPreview processing statistics without extraction.\n\n```bash\ntabuparse preview file1.pdf file2.pdf [OPTIONS]\n\nOptions:\n  -c, --config PATH       Configuration file\n```\n\n### `init-config`\nGenerate sample configuration file.\n\n```bash\ntabuparse init-config config.toml [OPTIONS]\n\nOptions:\n  --columns TEXT                 Expected column names (comma-separated)\n  --format [csv|sqlite]          Default output format\n  --flavor [lattice|stream]      Default extraction flavor\n```\n\n### `validate`\nValidate PDF file compatibility.\n\n```bash\ntabuparse validate document.pdf\n```\n\n## Library API\n\n### core functions\n\n```python\nfrom tabuparse import process_pdfs, extract_from_single_pdf\n\n# Process multiple PDFs\nresult_df = await process_pdfs(\n    pdf_paths=['file1.pdf', 'file2.pdf'],\n    config_path='settings.toml',\n    output_path='output.csv',\n    output_format='csv',\n    max_concurrent=5\n)\n\n# Extract from single PDF\ntables = await extract_from_single_pdf(\n    'document.pdf',\n    config_path='settings.toml'\n)\n```\n\n### configuration management\n\n```python\nfrom tabuparse.config_parser import parse_config, TabuparseConfig\n\n# Load configuration\nconfig = parse_config('settings.toml')\n\n# Create programmatic configuration\nconfig = TabuparseConfig(\n    expected_columns=['ID', 'Name', 'Amount'],\n    output_format='sqlite'\n)\n```\n\n### data processing\n\n```python\nfrom tabuparse.data_processor import normalize_schema, merge_dataframes\n\n# Normalize DataFrame schema\nnormalized_df = normalize_schema(\n    df,\n    expected_columns=['ID', 'Name', 'Amount'],\n    strict_mode=False\n)\n\n# Merge multiple DataFrames\nmerged_df = merge_dataframes([df1, df2, df3])\n```\n\n## Examples\n\n### basic invoice processing\n\n```bash\n# Process invoice PDFs with predefined schema\ntabuparse process invoices/*.pdf --config invoice_schema.toml --output invoices.csv\n```\n\n### multi-format export\n\n```python\nimport asyncio\nfrom tabuparse import process_pdfs\n\nasync def process_financial_data():\n    # Extract data\n    df = await process_pdfs(\n        pdf_paths=['q1_report.pdf', 'q2_report.pdf'],\n        config_path='financial_schema.toml'\n    )\n\n    # Export to multiple formats\n    df.to_csv('financial_data.csv', index=False)\n    df.to_excel('financial_data.xlsx', index=False)\n\n    return df\n\nasyncio.run(process_financial_data())\n```\n\n### custom processing pipeline\n\n```python\nfrom tabuparse.pdf_extractor import extract_tables_from_pdf\nfrom tabuparse.data_processor import normalize_schema\nfrom tabuparse.output_writer import write_sqlite\n\nasync def custom_pipeline():\n    # Extract tables\n    tables = await extract_tables_from_pdf('document.pdf')\n\n    # Process each table\n    processed_tables = []\n    for table in tables:\n        normalized = normalize_schema(\n            table,\n            expected_columns=['ID', 'Date', 'Amount']\n        )\n        processed_tables.append(normalized)\n\n    # Merge and export\n    import pandas as pd\n    merged = pd.concat(processed_tables, ignore_index=True)\n    write_sqlite(merged, 'output.sqlite', table_name='extracted_data')\n\nasyncio.run(custom_pipeline())\n```\n\n\n\n<br /><hr />\n<a href=\"https://www.flaticon.com/free-icons/samplings\" title=\"samplings icons\">Samplings icons by Afian Rochmah Afif - Flaticon</a>\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A Python CLI tool for extracting, normalizing, and merging tabular data from PDF documents",
    "version": "0.1.0",
    "project_urls": {
        "Homepage": "https://github.com/lupeke/tabuparse",
        "Issues": "https://github.com/lupeke/tabuparse/issues",
        "Repository": "https://github.com/lupeke/tabuparse"
    },
    "split_keywords": [
        "cli",
        " data engineering",
        " extraction",
        " parsing",
        " pdf",
        " table",
        " utility"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "36f84e268a367b948410b5e56c2b3dc8d55538e5f74d28107b74608efb0e090d",
                "md5": "6654b0c9a621bf021c4cf2c77e0bd168",
                "sha256": "43659b6aefe330ba367e59a1a38c3d0185b7bc633d6439f6022b157e7d74a21f"
            },
            "downloads": -1,
            "filename": "tabuparse-0.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "6654b0c9a621bf021c4cf2c77e0bd168",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.11",
            "size": 26527,
            "upload_time": "2025-07-11T16:50:56",
            "upload_time_iso_8601": "2025-07-11T16:50:56.340949Z",
            "url": "https://files.pythonhosted.org/packages/36/f8/4e268a367b948410b5e56c2b3dc8d55538e5f74d28107b74608efb0e090d/tabuparse-0.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "a79abd698be5772364838f45088a4df2b831f46a50952c05c746d0ee7663dca3",
                "md5": "3ce6bcea55fc4ce11832920bfd36a7cd",
                "sha256": "69ea91b2707a709d9ccf9df2fc2479a3d58b33cb0188883f3801400a4d4e6247"
            },
            "downloads": -1,
            "filename": "tabuparse-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "3ce6bcea55fc4ce11832920bfd36a7cd",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.11",
            "size": 27442,
            "upload_time": "2025-07-11T16:50:57",
            "upload_time_iso_8601": "2025-07-11T16:50:57.669389Z",
            "url": "https://files.pythonhosted.org/packages/a7/9a/bd698be5772364838f45088a4df2b831f46a50952c05c746d0ee7663dca3/tabuparse-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-11 16:50:57",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "lupeke",
    "github_project": "tabuparse",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "tabuparse"
}
        
Elapsed time: 0.40832s