| Name | tablediff-arrow JSON |
| Version |
0.1.0
JSON |
| download |
| home_page | None |
| Summary | Fast, file-based diffs for Parquet/CSV/Arrow (local or S3) with keyed comparisons, per-column tolerances, and HTML/CSV reports—built on Apache Arrow. |
| upload_time | 2025-10-13 05:51:53 |
| maintainer | None |
| docs_url | None |
| author | None |
| requires_python | >=3.10 |
| license | MIT |
| keywords |
diff
parquet
csv
arrow
data-comparison
|
| VCS |
 |
| bugtrack_url |
|
| requirements |
No requirements were recorded.
|
| Travis-CI |
No Travis.
|
| coveralls test coverage |
No coveralls.
|
# tablediff-arrow
Fast, file-based diffs for Parquet/CSV/Arrow (local or S3) with keyed comparisons, per-column tolerances, and HTML/CSV reports—built on Apache Arrow.
[](https://github.com/psmman/tablediff-arrow/actions)
[](https://www.python.org/downloads/)
[](https://opensource.org/licenses/MIT)
## Features
- **Fast**: Built on Apache Arrow for high-performance data processing
- **Multiple Formats**: Support for Parquet, CSV, and Arrow IPC files
- **S3 Support**: Read files directly from S3 (optional)
- **Keyed Comparisons**: Compare tables using one or more key columns
- **Numeric Tolerances**: Configure absolute and relative tolerances for numeric columns
- **Rich Reports**: Generate HTML and CSV reports with detailed differences
- **Python 3.10+**: Modern Python with type hints and clean APIs
- **Well Tested**: Comprehensive test suite with high coverage
## Installation
```bash
pip install tablediff-arrow
```
For S3 support:
```bash
pip install tablediff-arrow[s3]
```
For development:
```bash
pip install -e ".[dev]"
```
## Quick Start
### Command Line Interface
Compare two Parquet files using `id` as the key column:
```bash
tablediff left.parquet right.parquet -k id
```
Compare with numeric tolerance:
```bash
tablediff left.csv right.csv -k id -t amount:0.01
```
Generate an HTML report:
```bash
tablediff left.parquet right.parquet -k id -o report.html
```
Compare S3 files:
```bash
tablediff s3://bucket/left.parquet s3://bucket/right.parquet -k id --s3
```
### Python API
```python
from tablediff_arrow import TableDiff
# Create a differ with key columns and tolerances
differ = TableDiff(
key_columns=['id'],
tolerance={'amount': 0.01}, # Absolute tolerance
relative_tolerance={'price': 0.001} # Relative tolerance (0.1%)
)
# Compare files
result = differ.compare_files('left.parquet', 'right.parquet')
# Print summary
print(result.summary())
# Check if there are differences
if result.has_differences:
print(f"Found {result.changed_rows} changed rows")
print(f"Found {result.left_only_rows} rows only in left")
print(f"Found {result.right_only_rows} rows only in right")
# Generate reports
from tablediff_arrow.reports import generate_html_report, generate_csv_report
generate_html_report(result, 'report.html')
generate_csv_report(result, 'output_dir/', prefix='diff')
```
## Usage Examples
### Multiple Key Columns
Compare tables using composite keys:
```bash
tablediff left.parquet right.parquet -k year -k month -k product
```
```python
differ = TableDiff(key_columns=['year', 'month', 'product'])
result = differ.compare_files('left.parquet', 'right.parquet')
```
### Numeric Tolerances
Use absolute tolerance for monetary values:
```bash
tablediff left.csv right.csv -k id -t amount:0.01 -t balance:0.001
```
Use relative tolerance for percentages:
```bash
tablediff left.csv right.csv -k id -r rate:0.001 -r score:0.01
```
```python
differ = TableDiff(
key_columns=['id'],
tolerance={'amount': 0.01, 'balance': 0.001},
relative_tolerance={'rate': 0.001, 'score': 0.01}
)
```
### Working with PyArrow Tables
```python
import pyarrow as pa
from tablediff_arrow import TableDiff
# Create tables directly
left = pa.table({'id': [1, 2, 3], 'value': [10, 20, 30]})
right = pa.table({'id': [1, 2, 3], 'value': [10, 21, 30]})
# Compare
differ = TableDiff(key_columns=['id'])
result = differ.compare_tables(left, right)
print(result.summary())
```
### S3 Files
```python
import s3fs
from tablediff_arrow import TableDiff
# Create S3 filesystem
fs = s3fs.S3FileSystem()
# Compare S3 files
differ = TableDiff(key_columns=['id'])
result = differ.compare_files(
's3://my-bucket/left.parquet',
's3://my-bucket/right.parquet',
filesystem=fs
)
```
## CLI Options
```
Usage: tablediff [OPTIONS] LEFT RIGHT
Compare two tables and generate diff reports.
Arguments:
LEFT Path to the left/source table file (local or s3://)
RIGHT Path to the right/target table file (local or s3://)
Options:
-k, --key TEXT Key column(s) for comparison (required, can be
specified multiple times)
-t, --tolerance TEXT Absolute tolerance for numeric columns
(format: column:value)
-r, --relative-tolerance Relative tolerance for numeric columns
(format: column:value)
--left-format [parquet|csv|arrow]
Format of the left file
--right-format [parquet|csv|arrow]
Format of the right file
-o, --output TEXT Output file path for HTML report
--csv-output PATH Output directory for CSV reports
--s3 Enable S3 filesystem support
--help Show this message and exit.
```
## Output Reports
### HTML Report
The HTML report provides an interactive view of differences:
- Summary statistics (matched, changed, added, removed rows)
- Color-coded differences table
- Separate sections for left-only and right-only rows
- Change counts per column
### CSV Reports
CSV output generates multiple files:
- `{prefix}_summary.csv`: Summary statistics
- `{prefix}_changes.csv`: Detailed changes with old and new values
- `{prefix}_left_only.csv`: Rows only in the left table
- `{prefix}_right_only.csv`: Rows only in the right table
## Development
### Setup
```bash
# Clone the repository
git clone https://github.com/psmman/tablediff-arrow.git
cd tablediff-arrow
# Install with development dependencies
pip install -e ".[dev]"
# Install pre-commit hooks
pre-commit install
```
### Running Tests
```bash
# Run all tests
pytest
# Run with coverage
pytest --cov=tablediff_arrow --cov-report=html
# Run specific test file
pytest tests/test_compare.py
```
### Code Quality
```bash
# Format code
black src tests
# Lint
ruff check src tests
# Type check
mypy src
```
### Pre-commit Hooks
The project uses pre-commit hooks to ensure code quality:
- trailing-whitespace: Remove trailing whitespace
- end-of-file-fixer: Ensure files end with a newline
- check-yaml/json/toml: Validate config files
- black: Format Python code
- ruff: Lint Python code
- mypy: Type checking
## Requirements
- Python 3.10 or higher
- pyarrow >= 14.0.0
- pandas >= 2.0.0
- click >= 8.0.0
- jinja2 >= 3.0.0
- s3fs >= 2023.0.0 (optional, for S3 support)
## License
MIT License - see [LICENSE](LICENSE) file for details.
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Raw data
{
"_id": null,
"home_page": null,
"name": "tablediff-arrow",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": "diff, parquet, csv, arrow, data-comparison",
"author": null,
"author_email": "Prasenjit Singh <psmman@users.noreply.github.com>",
"download_url": "https://files.pythonhosted.org/packages/e8/27/c5c40e7cf36b3893155b40e6fd3274944d40a8dfc0696cac74df54599919/tablediff_arrow-0.1.0.tar.gz",
"platform": null,
"description": "\ufeff# tablediff-arrow\r\n\r\nFast, file-based diffs for Parquet/CSV/Arrow (local or S3) with keyed comparisons, per-column tolerances, and HTML/CSV reports\u2014built on Apache Arrow.\r\n\r\n[](https://github.com/psmman/tablediff-arrow/actions)\r\n[](https://www.python.org/downloads/)\r\n[](https://opensource.org/licenses/MIT)\r\n\r\n## Features\r\n\r\n- **Fast**: Built on Apache Arrow for high-performance data processing\r\n- **Multiple Formats**: Support for Parquet, CSV, and Arrow IPC files\r\n- **S3 Support**: Read files directly from S3 (optional)\r\n- **Keyed Comparisons**: Compare tables using one or more key columns\r\n- **Numeric Tolerances**: Configure absolute and relative tolerances for numeric columns\r\n- **Rich Reports**: Generate HTML and CSV reports with detailed differences\r\n- **Python 3.10+**: Modern Python with type hints and clean APIs\r\n- **Well Tested**: Comprehensive test suite with high coverage\r\n\r\n## Installation\r\n\r\n```bash\r\npip install tablediff-arrow\r\n```\r\n\r\nFor S3 support:\r\n\r\n```bash\r\npip install tablediff-arrow[s3]\r\n```\r\n\r\nFor development:\r\n\r\n```bash\r\npip install -e \".[dev]\"\r\n```\r\n\r\n## Quick Start\r\n\r\n### Command Line Interface\r\n\r\nCompare two Parquet files using `id` as the key column:\r\n\r\n```bash\r\ntablediff left.parquet right.parquet -k id\r\n```\r\n\r\nCompare with numeric tolerance:\r\n\r\n```bash\r\ntablediff left.csv right.csv -k id -t amount:0.01\r\n```\r\n\r\nGenerate an HTML report:\r\n\r\n```bash\r\ntablediff left.parquet right.parquet -k id -o report.html\r\n```\r\n\r\nCompare S3 files:\r\n\r\n```bash\r\ntablediff s3://bucket/left.parquet s3://bucket/right.parquet -k id --s3\r\n```\r\n\r\n### Python API\r\n\r\n```python\r\nfrom tablediff_arrow import TableDiff\r\n\r\n# Create a differ with key columns and tolerances\r\ndiffer = TableDiff(\r\n key_columns=['id'],\r\n tolerance={'amount': 0.01}, # Absolute tolerance\r\n relative_tolerance={'price': 0.001} # Relative tolerance (0.1%)\r\n)\r\n\r\n# Compare files\r\nresult = differ.compare_files('left.parquet', 'right.parquet')\r\n\r\n# Print summary\r\nprint(result.summary())\r\n\r\n# Check if there are differences\r\nif result.has_differences:\r\n print(f\"Found {result.changed_rows} changed rows\")\r\n print(f\"Found {result.left_only_rows} rows only in left\")\r\n print(f\"Found {result.right_only_rows} rows only in right\")\r\n\r\n# Generate reports\r\nfrom tablediff_arrow.reports import generate_html_report, generate_csv_report\r\n\r\ngenerate_html_report(result, 'report.html')\r\ngenerate_csv_report(result, 'output_dir/', prefix='diff')\r\n```\r\n\r\n## Usage Examples\r\n\r\n### Multiple Key Columns\r\n\r\nCompare tables using composite keys:\r\n\r\n```bash\r\ntablediff left.parquet right.parquet -k year -k month -k product\r\n```\r\n\r\n```python\r\ndiffer = TableDiff(key_columns=['year', 'month', 'product'])\r\nresult = differ.compare_files('left.parquet', 'right.parquet')\r\n```\r\n\r\n### Numeric Tolerances\r\n\r\nUse absolute tolerance for monetary values:\r\n\r\n```bash\r\ntablediff left.csv right.csv -k id -t amount:0.01 -t balance:0.001\r\n```\r\n\r\nUse relative tolerance for percentages:\r\n\r\n```bash\r\ntablediff left.csv right.csv -k id -r rate:0.001 -r score:0.01\r\n```\r\n\r\n```python\r\ndiffer = TableDiff(\r\n key_columns=['id'],\r\n tolerance={'amount': 0.01, 'balance': 0.001},\r\n relative_tolerance={'rate': 0.001, 'score': 0.01}\r\n)\r\n```\r\n\r\n### Working with PyArrow Tables\r\n\r\n```python\r\nimport pyarrow as pa\r\nfrom tablediff_arrow import TableDiff\r\n\r\n# Create tables directly\r\nleft = pa.table({'id': [1, 2, 3], 'value': [10, 20, 30]})\r\nright = pa.table({'id': [1, 2, 3], 'value': [10, 21, 30]})\r\n\r\n# Compare\r\ndiffer = TableDiff(key_columns=['id'])\r\nresult = differ.compare_tables(left, right)\r\n\r\nprint(result.summary())\r\n```\r\n\r\n### S3 Files\r\n\r\n```python\r\nimport s3fs\r\nfrom tablediff_arrow import TableDiff\r\n\r\n# Create S3 filesystem\r\nfs = s3fs.S3FileSystem()\r\n\r\n# Compare S3 files\r\ndiffer = TableDiff(key_columns=['id'])\r\nresult = differ.compare_files(\r\n 's3://my-bucket/left.parquet',\r\n 's3://my-bucket/right.parquet',\r\n filesystem=fs\r\n)\r\n```\r\n\r\n## CLI Options\r\n\r\n```\r\nUsage: tablediff [OPTIONS] LEFT RIGHT\r\n\r\n Compare two tables and generate diff reports.\r\n\r\nArguments:\r\n LEFT Path to the left/source table file (local or s3://)\r\n RIGHT Path to the right/target table file (local or s3://)\r\n\r\nOptions:\r\n -k, --key TEXT Key column(s) for comparison (required, can be\r\n specified multiple times)\r\n -t, --tolerance TEXT Absolute tolerance for numeric columns\r\n (format: column:value)\r\n -r, --relative-tolerance Relative tolerance for numeric columns\r\n (format: column:value)\r\n --left-format [parquet|csv|arrow]\r\n Format of the left file\r\n --right-format [parquet|csv|arrow]\r\n Format of the right file\r\n -o, --output TEXT Output file path for HTML report\r\n --csv-output PATH Output directory for CSV reports\r\n --s3 Enable S3 filesystem support\r\n --help Show this message and exit.\r\n```\r\n\r\n## Output Reports\r\n\r\n### HTML Report\r\n\r\nThe HTML report provides an interactive view of differences:\r\n\r\n- Summary statistics (matched, changed, added, removed rows)\r\n- Color-coded differences table\r\n- Separate sections for left-only and right-only rows\r\n- Change counts per column\r\n\r\n### CSV Reports\r\n\r\nCSV output generates multiple files:\r\n\r\n- `{prefix}_summary.csv`: Summary statistics\r\n- `{prefix}_changes.csv`: Detailed changes with old and new values\r\n- `{prefix}_left_only.csv`: Rows only in the left table\r\n- `{prefix}_right_only.csv`: Rows only in the right table\r\n\r\n## Development\r\n\r\n### Setup\r\n\r\n```bash\r\n# Clone the repository\r\ngit clone https://github.com/psmman/tablediff-arrow.git\r\ncd tablediff-arrow\r\n\r\n# Install with development dependencies\r\npip install -e \".[dev]\"\r\n\r\n# Install pre-commit hooks\r\npre-commit install\r\n```\r\n\r\n### Running Tests\r\n\r\n```bash\r\n# Run all tests\r\npytest\r\n\r\n# Run with coverage\r\npytest --cov=tablediff_arrow --cov-report=html\r\n\r\n# Run specific test file\r\npytest tests/test_compare.py\r\n```\r\n\r\n### Code Quality\r\n\r\n```bash\r\n# Format code\r\nblack src tests\r\n\r\n# Lint\r\nruff check src tests\r\n\r\n# Type check\r\nmypy src\r\n```\r\n\r\n### Pre-commit Hooks\r\n\r\nThe project uses pre-commit hooks to ensure code quality:\r\n\r\n- trailing-whitespace: Remove trailing whitespace\r\n- end-of-file-fixer: Ensure files end with a newline\r\n- check-yaml/json/toml: Validate config files\r\n- black: Format Python code\r\n- ruff: Lint Python code\r\n- mypy: Type checking\r\n\r\n## Requirements\r\n\r\n- Python 3.10 or higher\r\n- pyarrow >= 14.0.0\r\n- pandas >= 2.0.0\r\n- click >= 8.0.0\r\n- jinja2 >= 3.0.0\r\n- s3fs >= 2023.0.0 (optional, for S3 support)\r\n\r\n## License\r\n\r\nMIT License - see [LICENSE](LICENSE) file for details.\r\n\r\n## Contributing\r\n\r\nContributions are welcome! Please feel free to submit a Pull Request.\r\n\r\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Fast, file-based diffs for Parquet/CSV/Arrow (local or S3) with keyed comparisons, per-column tolerances, and HTML/CSV reports\u2014built on Apache Arrow.",
"version": "0.1.0",
"project_urls": {
"Homepage": "https://github.com/psmman/tablediff-arrow",
"Issues": "https://github.com/psmman/tablediff-arrow/issues",
"Repository": "https://github.com/psmman/tablediff-arrow"
},
"split_keywords": [
"diff",
" parquet",
" csv",
" arrow",
" data-comparison"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "495b7ac64916f844840ef32f42e0905f5ba600f74b2e0b759dd16252800bede3",
"md5": "153f1725fe26cef3479f2c8de1d4e0cc",
"sha256": "f3abb723c8e8058d8288c43048c86f14b2b8b289631e4f12d879096c53b9bf61"
},
"downloads": -1,
"filename": "tablediff_arrow-0.1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "153f1725fe26cef3479f2c8de1d4e0cc",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 12335,
"upload_time": "2025-10-13T05:51:51",
"upload_time_iso_8601": "2025-10-13T05:51:51.756045Z",
"url": "https://files.pythonhosted.org/packages/49/5b/7ac64916f844840ef32f42e0905f5ba600f74b2e0b759dd16252800bede3/tablediff_arrow-0.1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "e827c5c40e7cf36b3893155b40e6fd3274944d40a8dfc0696cac74df54599919",
"md5": "a19ce70975204c8bd34239f2300de4b9",
"sha256": "c23fb28970c27f095d8193710e0825d69ad5ad1120ab5c3189a3d51ed95d82c1"
},
"downloads": -1,
"filename": "tablediff_arrow-0.1.0.tar.gz",
"has_sig": false,
"md5_digest": "a19ce70975204c8bd34239f2300de4b9",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 16390,
"upload_time": "2025-10-13T05:51:53",
"upload_time_iso_8601": "2025-10-13T05:51:53.080725Z",
"url": "https://files.pythonhosted.org/packages/e8/27/c5c40e7cf36b3893155b40e6fd3274944d40a8dfc0696cac74df54599919/tablediff_arrow-0.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-13 05:51:53",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "psmman",
"github_project": "tablediff-arrow",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "tablediff-arrow"
}