<div align="center">
<img src="https://storage.googleapis.com/lupeke.dev/_tabuparse.png" alt="tabuparse" width="250" /><br />
<p><b>extract, transform and export PDF tabular data</b></p>
<p>
<img src="https://img.shields.io/badge/python-3.10+-blue.svg" alt="Python 3.10+" />
<img src="https://img.shields.io/badge/asyncio-ready-blueviolet" alt="asyncio support" />
<img src="https://img.shields.io/badge/License-MIT-yellow.svg" alt="License: MIT" />
</p>
</div>
## About
```tabuparse``` is a Python CLI tool and library for extracting, normalizing, and merging tabular data from PDF documents.
## Installation
> [!WARNING]
> This project is still in alpha mode and might go sideways.
### From source
```bash
git clone https://github.com/lupeke/tabuparse.git && \
cd tabuparse && \
python3 -m venv .venv && source .venv/bin/activate && \
pip install -e .
```
#### Run a health check
```bash
python tests/check_install.py
```
## Quick start
### CLI usage
```bash
# Process single PDF with default settings
tabuparse process example.pdf
# Process multiple PDFs with configuration
tabuparse process *.pdf --config settings.toml --output data.csv
# Export to SQLite with summary statistics
tabuparse process documents/*.pdf --format sqlite --summary
# Preview processing without extraction
tabuparse preview *.pdf --config settings.toml
# Extract from single PDF for testing
tabuparse extract document.pdf --pages "1-3" --flavor stream
```
### Library usage
```python
import asyncio
from tabuparse import process_pdfs
async def main():
# Process PDFs and get merged DataFrame
result_df = await process_pdfs(
pdf_paths=['invoice1.pdf', 'invoice2.pdf'],
config_path='schema.toml',
output_format='csv'
)
print(f"Extracted {len(result_df)} rows")
print(result_df.head())
asyncio.run(main())
```
## Configuration
```tabuparse``` uses TOML configuration files to define extraction parameters and expected schemas.
### generate sample configuration
```bash
tabuparse init-config settings.toml --columns "Invoice ID,Date,Amount,Description"
```
### configuration structure
```toml
# settings.toml
[table_structure]
expected_columns = [
"Invoice ID",
"Date",
"Item Description",
"Quantity",
"Unit Price",
"Total Amount"
]
[settings]
output_format = "csv"
strict_schema = false
[default_extraction]
flavor = "lattice"
pages = "all"
# PDF-specific extraction parameters
[[extraction_parameters]]
pdf_path = "invoice_batch_1.pdf"
pages = "1-5"
flavor = "lattice"
[[extraction_parameters]]
pdf_path = "statements.pdf"
pages = "all"
flavor = "stream"
table_areas = ["72,72,432,648"] # left,bottom,right,top in points
```
### Configuration Options
#### table structure
- `expected_columns`: List of column names for schema normalization
#### settings
- `output_format`: "csv" or "sqlite"
- `strict_schema`: Enable strict schema validation (fail on mismatches)
#### extraction parameters
- `pages`: Page selection ("all", "1", "1,3,5", "1-3")
- `flavor`: Camelot extraction method ("lattice" or "stream")
- `table_areas`: Specific table regions to extract
- `pdf_path`: Apply parameters to specific PDF files
## CLI Commands
### `process`
Extract and merge tables from multiple PDF files.
```bash
tabuparse process file1.pdf file2.pdf [OPTIONS]
Options:
-c, --config PATH TOML configuration file
-o, --output PATH Output file path
--format [csv|sqlite] Output format (default: csv)
--max-concurrent INT Max concurrent extractions (default: 5)
--summary Export summary statistics
--no-clean Disable data cleaning
--strict Enable strict schema validation
```
### `extract`
Extract tables from a single PDF (for testing).
```bash
tabuparse extract document.pdf [OPTIONS]
Options:
-c, --config PATH Configuration file
--pages TEXT Pages to extract
--flavor [lattice|stream] Extraction method
--show-info Show detailed table information
```
### `preview`
Preview processing statistics without extraction.
```bash
tabuparse preview file1.pdf file2.pdf [OPTIONS]
Options:
-c, --config PATH Configuration file
```
### `init-config`
Generate sample configuration file.
```bash
tabuparse init-config config.toml [OPTIONS]
Options:
--columns TEXT Expected column names (comma-separated)
--format [csv|sqlite] Default output format
--flavor [lattice|stream] Default extraction flavor
```
### `validate`
Validate PDF file compatibility.
```bash
tabuparse validate document.pdf
```
## Library API
### core functions
```python
from tabuparse import process_pdfs, extract_from_single_pdf
# Process multiple PDFs
result_df = await process_pdfs(
pdf_paths=['file1.pdf', 'file2.pdf'],
config_path='settings.toml',
output_path='output.csv',
output_format='csv',
max_concurrent=5
)
# Extract from single PDF
tables = await extract_from_single_pdf(
'document.pdf',
config_path='settings.toml'
)
```
### configuration management
```python
from tabuparse.config_parser import parse_config, TabuparseConfig
# Load configuration
config = parse_config('settings.toml')
# Create programmatic configuration
config = TabuparseConfig(
expected_columns=['ID', 'Name', 'Amount'],
output_format='sqlite'
)
```
### data processing
```python
from tabuparse.data_processor import normalize_schema, merge_dataframes
# Normalize DataFrame schema
normalized_df = normalize_schema(
df,
expected_columns=['ID', 'Name', 'Amount'],
strict_mode=False
)
# Merge multiple DataFrames
merged_df = merge_dataframes([df1, df2, df3])
```
## Examples
### basic invoice processing
```bash
# Process invoice PDFs with predefined schema
tabuparse process invoices/*.pdf --config invoice_schema.toml --output invoices.csv
```
### multi-format export
```python
import asyncio
from tabuparse import process_pdfs
async def process_financial_data():
# Extract data
df = await process_pdfs(
pdf_paths=['q1_report.pdf', 'q2_report.pdf'],
config_path='financial_schema.toml'
)
# Export to multiple formats
df.to_csv('financial_data.csv', index=False)
df.to_excel('financial_data.xlsx', index=False)
return df
asyncio.run(process_financial_data())
```
### custom processing pipeline
```python
from tabuparse.pdf_extractor import extract_tables_from_pdf
from tabuparse.data_processor import normalize_schema
from tabuparse.output_writer import write_sqlite
async def custom_pipeline():
# Extract tables
tables = await extract_tables_from_pdf('document.pdf')
# Process each table
processed_tables = []
for table in tables:
normalized = normalize_schema(
table,
expected_columns=['ID', 'Date', 'Amount']
)
processed_tables.append(normalized)
# Merge and export
import pandas as pd
merged = pd.concat(processed_tables, ignore_index=True)
write_sqlite(merged, 'output.sqlite', table_name='extracted_data')
asyncio.run(custom_pipeline())
```
<br /><hr />
<a href="https://www.flaticon.com/free-icons/samplings" title="samplings icons">Samplings icons by Afian Rochmah Afif - Flaticon</a>
Raw data
{
"_id": null,
"home_page": null,
"name": "tabuparse",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.11",
"maintainer_email": null,
"keywords": "cli, data engineering, extraction, parsing, pdf, table, utility",
"author": null,
"author_email": "Daniel Dias <daniel@lupeke.dev>",
"download_url": "https://files.pythonhosted.org/packages/a7/9a/bd698be5772364838f45088a4df2b831f46a50952c05c746d0ee7663dca3/tabuparse-0.1.0.tar.gz",
"platform": null,
"description": "<div align=\"center\">\n <img src=\"https://storage.googleapis.com/lupeke.dev/_tabuparse.png\" alt=\"tabuparse\" width=\"250\" /><br />\n <p><b>extract, transform and export PDF tabular data</b></p>\n <p>\n <img src=\"https://img.shields.io/badge/python-3.10+-blue.svg\" alt=\"Python 3.10+\" />\n <img src=\"https://img.shields.io/badge/asyncio-ready-blueviolet\" alt=\"asyncio support\" />\n <img src=\"https://img.shields.io/badge/License-MIT-yellow.svg\" alt=\"License: MIT\" />\n </p>\n</div>\n\n## About\n```tabuparse``` is a Python CLI tool and library for extracting, normalizing, and merging tabular data from PDF documents.\n\n## Installation\n\n> [!WARNING]\n> This project is still in alpha mode and might go sideways.\n\n\n### From source\n\n```bash\ngit clone https://github.com/lupeke/tabuparse.git && \\\ncd tabuparse && \\\npython3 -m venv .venv && source .venv/bin/activate && \\\npip install -e .\n```\n\n#### Run a health check\n```bash\npython tests/check_install.py\n```\n\n## Quick start\n\n### CLI usage\n\n```bash\n# Process single PDF with default settings\ntabuparse process example.pdf\n\n# Process multiple PDFs with configuration\ntabuparse process *.pdf --config settings.toml --output data.csv\n\n# Export to SQLite with summary statistics\ntabuparse process documents/*.pdf --format sqlite --summary\n\n# Preview processing without extraction\ntabuparse preview *.pdf --config settings.toml\n\n# Extract from single PDF for testing\ntabuparse extract document.pdf --pages \"1-3\" --flavor stream\n```\n\n### Library usage\n\n```python\nimport asyncio\nfrom tabuparse import process_pdfs\n\nasync def main():\n # Process PDFs and get merged DataFrame\n result_df = await process_pdfs(\n pdf_paths=['invoice1.pdf', 'invoice2.pdf'],\n config_path='schema.toml',\n output_format='csv'\n )\n\n print(f\"Extracted {len(result_df)} rows\")\n print(result_df.head())\n\nasyncio.run(main())\n```\n\n## Configuration\n\n```tabuparse``` uses TOML configuration files to define extraction parameters and expected schemas.\n\n### generate sample configuration\n\n```bash\ntabuparse init-config settings.toml --columns \"Invoice ID,Date,Amount,Description\"\n```\n\n### configuration structure\n\n```toml\n# settings.toml\n[table_structure]\nexpected_columns = [\n \"Invoice ID\",\n \"Date\",\n \"Item Description\",\n \"Quantity\",\n \"Unit Price\",\n \"Total Amount\"\n]\n\n[settings]\noutput_format = \"csv\"\nstrict_schema = false\n\n[default_extraction]\nflavor = \"lattice\"\npages = \"all\"\n\n# PDF-specific extraction parameters\n[[extraction_parameters]]\npdf_path = \"invoice_batch_1.pdf\"\npages = \"1-5\"\nflavor = \"lattice\"\n\n[[extraction_parameters]]\npdf_path = \"statements.pdf\"\npages = \"all\"\nflavor = \"stream\"\ntable_areas = [\"72,72,432,648\"] # left,bottom,right,top in points\n```\n\n### Configuration Options\n\n#### table structure\n- `expected_columns`: List of column names for schema normalization\n\n#### settings\n- `output_format`: \"csv\" or \"sqlite\"\n- `strict_schema`: Enable strict schema validation (fail on mismatches)\n\n#### extraction parameters\n- `pages`: Page selection (\"all\", \"1\", \"1,3,5\", \"1-3\")\n- `flavor`: Camelot extraction method (\"lattice\" or \"stream\")\n- `table_areas`: Specific table regions to extract\n- `pdf_path`: Apply parameters to specific PDF files\n\n## CLI Commands\n\n### `process`\nExtract and merge tables from multiple PDF files.\n\n```bash\ntabuparse process file1.pdf file2.pdf [OPTIONS]\n\nOptions:\n -c, --config PATH TOML configuration file\n -o, --output PATH Output file path\n --format [csv|sqlite] Output format (default: csv)\n --max-concurrent INT Max concurrent extractions (default: 5)\n --summary Export summary statistics\n --no-clean Disable data cleaning\n --strict Enable strict schema validation\n```\n\n### `extract`\nExtract tables from a single PDF (for testing).\n\n```bash\ntabuparse extract document.pdf [OPTIONS]\n\nOptions:\n -c, --config PATH Configuration file\n --pages TEXT Pages to extract\n --flavor [lattice|stream] Extraction method\n --show-info Show detailed table information\n```\n\n### `preview`\nPreview processing statistics without extraction.\n\n```bash\ntabuparse preview file1.pdf file2.pdf [OPTIONS]\n\nOptions:\n -c, --config PATH Configuration file\n```\n\n### `init-config`\nGenerate sample configuration file.\n\n```bash\ntabuparse init-config config.toml [OPTIONS]\n\nOptions:\n --columns TEXT Expected column names (comma-separated)\n --format [csv|sqlite] Default output format\n --flavor [lattice|stream] Default extraction flavor\n```\n\n### `validate`\nValidate PDF file compatibility.\n\n```bash\ntabuparse validate document.pdf\n```\n\n## Library API\n\n### core functions\n\n```python\nfrom tabuparse import process_pdfs, extract_from_single_pdf\n\n# Process multiple PDFs\nresult_df = await process_pdfs(\n pdf_paths=['file1.pdf', 'file2.pdf'],\n config_path='settings.toml',\n output_path='output.csv',\n output_format='csv',\n max_concurrent=5\n)\n\n# Extract from single PDF\ntables = await extract_from_single_pdf(\n 'document.pdf',\n config_path='settings.toml'\n)\n```\n\n### configuration management\n\n```python\nfrom tabuparse.config_parser import parse_config, TabuparseConfig\n\n# Load configuration\nconfig = parse_config('settings.toml')\n\n# Create programmatic configuration\nconfig = TabuparseConfig(\n expected_columns=['ID', 'Name', 'Amount'],\n output_format='sqlite'\n)\n```\n\n### data processing\n\n```python\nfrom tabuparse.data_processor import normalize_schema, merge_dataframes\n\n# Normalize DataFrame schema\nnormalized_df = normalize_schema(\n df,\n expected_columns=['ID', 'Name', 'Amount'],\n strict_mode=False\n)\n\n# Merge multiple DataFrames\nmerged_df = merge_dataframes([df1, df2, df3])\n```\n\n## Examples\n\n### basic invoice processing\n\n```bash\n# Process invoice PDFs with predefined schema\ntabuparse process invoices/*.pdf --config invoice_schema.toml --output invoices.csv\n```\n\n### multi-format export\n\n```python\nimport asyncio\nfrom tabuparse import process_pdfs\n\nasync def process_financial_data():\n # Extract data\n df = await process_pdfs(\n pdf_paths=['q1_report.pdf', 'q2_report.pdf'],\n config_path='financial_schema.toml'\n )\n\n # Export to multiple formats\n df.to_csv('financial_data.csv', index=False)\n df.to_excel('financial_data.xlsx', index=False)\n\n return df\n\nasyncio.run(process_financial_data())\n```\n\n### custom processing pipeline\n\n```python\nfrom tabuparse.pdf_extractor import extract_tables_from_pdf\nfrom tabuparse.data_processor import normalize_schema\nfrom tabuparse.output_writer import write_sqlite\n\nasync def custom_pipeline():\n # Extract tables\n tables = await extract_tables_from_pdf('document.pdf')\n\n # Process each table\n processed_tables = []\n for table in tables:\n normalized = normalize_schema(\n table,\n expected_columns=['ID', 'Date', 'Amount']\n )\n processed_tables.append(normalized)\n\n # Merge and export\n import pandas as pd\n merged = pd.concat(processed_tables, ignore_index=True)\n write_sqlite(merged, 'output.sqlite', table_name='extracted_data')\n\nasyncio.run(custom_pipeline())\n```\n\n\n\n<br /><hr />\n<a href=\"https://www.flaticon.com/free-icons/samplings\" title=\"samplings icons\">Samplings icons by Afian Rochmah Afif - Flaticon</a>\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A Python CLI tool for extracting, normalizing, and merging tabular data from PDF documents",
"version": "0.1.0",
"project_urls": {
"Homepage": "https://github.com/lupeke/tabuparse",
"Issues": "https://github.com/lupeke/tabuparse/issues",
"Repository": "https://github.com/lupeke/tabuparse"
},
"split_keywords": [
"cli",
" data engineering",
" extraction",
" parsing",
" pdf",
" table",
" utility"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "36f84e268a367b948410b5e56c2b3dc8d55538e5f74d28107b74608efb0e090d",
"md5": "6654b0c9a621bf021c4cf2c77e0bd168",
"sha256": "43659b6aefe330ba367e59a1a38c3d0185b7bc633d6439f6022b157e7d74a21f"
},
"downloads": -1,
"filename": "tabuparse-0.1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "6654b0c9a621bf021c4cf2c77e0bd168",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.11",
"size": 26527,
"upload_time": "2025-07-11T16:50:56",
"upload_time_iso_8601": "2025-07-11T16:50:56.340949Z",
"url": "https://files.pythonhosted.org/packages/36/f8/4e268a367b948410b5e56c2b3dc8d55538e5f74d28107b74608efb0e090d/tabuparse-0.1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "a79abd698be5772364838f45088a4df2b831f46a50952c05c746d0ee7663dca3",
"md5": "3ce6bcea55fc4ce11832920bfd36a7cd",
"sha256": "69ea91b2707a709d9ccf9df2fc2479a3d58b33cb0188883f3801400a4d4e6247"
},
"downloads": -1,
"filename": "tabuparse-0.1.0.tar.gz",
"has_sig": false,
"md5_digest": "3ce6bcea55fc4ce11832920bfd36a7cd",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.11",
"size": 27442,
"upload_time": "2025-07-11T16:50:57",
"upload_time_iso_8601": "2025-07-11T16:50:57.669389Z",
"url": "https://files.pythonhosted.org/packages/a7/9a/bd698be5772364838f45088a4df2b831f46a50952c05c746d0ee7663dca3/tabuparse-0.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-11 16:50:57",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "lupeke",
"github_project": "tabuparse",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "tabuparse"
}