tablediff-arrow


Nametablediff-arrow JSON
Version 0.1.0 PyPI version JSON
download
home_pageNone
SummaryFast, file-based diffs for Parquet/CSV/Arrow (local or S3) with keyed comparisons, per-column tolerances, and HTML/CSV reports—built on Apache Arrow.
upload_time2025-10-13 05:51:53
maintainerNone
docs_urlNone
authorNone
requires_python>=3.10
licenseMIT
keywords diff parquet csv arrow data-comparison
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # tablediff-arrow

Fast, file-based diffs for Parquet/CSV/Arrow (local or S3) with keyed comparisons, per-column tolerances, and HTML/CSV reports—built on Apache Arrow.

[![CI](https://github.com/psmman/tablediff-arrow/workflows/CI/badge.svg)](https://github.com/psmman/tablediff-arrow/actions)
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

## Features

- **Fast**: Built on Apache Arrow for high-performance data processing
- **Multiple Formats**: Support for Parquet, CSV, and Arrow IPC files
- **S3 Support**: Read files directly from S3 (optional)
- **Keyed Comparisons**: Compare tables using one or more key columns
- **Numeric Tolerances**: Configure absolute and relative tolerances for numeric columns
- **Rich Reports**: Generate HTML and CSV reports with detailed differences
- **Python 3.10+**: Modern Python with type hints and clean APIs
- **Well Tested**: Comprehensive test suite with high coverage

## Installation

```bash
pip install tablediff-arrow
```

For S3 support:

```bash
pip install tablediff-arrow[s3]
```

For development:

```bash
pip install -e ".[dev]"
```

## Quick Start

### Command Line Interface

Compare two Parquet files using `id` as the key column:

```bash
tablediff left.parquet right.parquet -k id
```

Compare with numeric tolerance:

```bash
tablediff left.csv right.csv -k id -t amount:0.01
```

Generate an HTML report:

```bash
tablediff left.parquet right.parquet -k id -o report.html
```

Compare S3 files:

```bash
tablediff s3://bucket/left.parquet s3://bucket/right.parquet -k id --s3
```

### Python API

```python
from tablediff_arrow import TableDiff

# Create a differ with key columns and tolerances
differ = TableDiff(
    key_columns=['id'],
    tolerance={'amount': 0.01},  # Absolute tolerance
    relative_tolerance={'price': 0.001}  # Relative tolerance (0.1%)
)

# Compare files
result = differ.compare_files('left.parquet', 'right.parquet')

# Print summary
print(result.summary())

# Check if there are differences
if result.has_differences:
    print(f"Found {result.changed_rows} changed rows")
    print(f"Found {result.left_only_rows} rows only in left")
    print(f"Found {result.right_only_rows} rows only in right")

# Generate reports
from tablediff_arrow.reports import generate_html_report, generate_csv_report

generate_html_report(result, 'report.html')
generate_csv_report(result, 'output_dir/', prefix='diff')
```

## Usage Examples

### Multiple Key Columns

Compare tables using composite keys:

```bash
tablediff left.parquet right.parquet -k year -k month -k product
```

```python
differ = TableDiff(key_columns=['year', 'month', 'product'])
result = differ.compare_files('left.parquet', 'right.parquet')
```

### Numeric Tolerances

Use absolute tolerance for monetary values:

```bash
tablediff left.csv right.csv -k id -t amount:0.01 -t balance:0.001
```

Use relative tolerance for percentages:

```bash
tablediff left.csv right.csv -k id -r rate:0.001 -r score:0.01
```

```python
differ = TableDiff(
    key_columns=['id'],
    tolerance={'amount': 0.01, 'balance': 0.001},
    relative_tolerance={'rate': 0.001, 'score': 0.01}
)
```

### Working with PyArrow Tables

```python
import pyarrow as pa
from tablediff_arrow import TableDiff

# Create tables directly
left = pa.table({'id': [1, 2, 3], 'value': [10, 20, 30]})
right = pa.table({'id': [1, 2, 3], 'value': [10, 21, 30]})

# Compare
differ = TableDiff(key_columns=['id'])
result = differ.compare_tables(left, right)

print(result.summary())
```

### S3 Files

```python
import s3fs
from tablediff_arrow import TableDiff

# Create S3 filesystem
fs = s3fs.S3FileSystem()

# Compare S3 files
differ = TableDiff(key_columns=['id'])
result = differ.compare_files(
    's3://my-bucket/left.parquet',
    's3://my-bucket/right.parquet',
    filesystem=fs
)
```

## CLI Options

```
Usage: tablediff [OPTIONS] LEFT RIGHT

  Compare two tables and generate diff reports.

Arguments:
  LEFT   Path to the left/source table file (local or s3://)
  RIGHT  Path to the right/target table file (local or s3://)

Options:
  -k, --key TEXT              Key column(s) for comparison (required, can be
                              specified multiple times)
  -t, --tolerance TEXT        Absolute tolerance for numeric columns
                              (format: column:value)
  -r, --relative-tolerance    Relative tolerance for numeric columns
                              (format: column:value)
  --left-format [parquet|csv|arrow]
                              Format of the left file
  --right-format [parquet|csv|arrow]
                              Format of the right file
  -o, --output TEXT           Output file path for HTML report
  --csv-output PATH           Output directory for CSV reports
  --s3                        Enable S3 filesystem support
  --help                      Show this message and exit.
```

## Output Reports

### HTML Report

The HTML report provides an interactive view of differences:

- Summary statistics (matched, changed, added, removed rows)
- Color-coded differences table
- Separate sections for left-only and right-only rows
- Change counts per column

### CSV Reports

CSV output generates multiple files:

- `{prefix}_summary.csv`: Summary statistics
- `{prefix}_changes.csv`: Detailed changes with old and new values
- `{prefix}_left_only.csv`: Rows only in the left table
- `{prefix}_right_only.csv`: Rows only in the right table

## Development

### Setup

```bash
# Clone the repository
git clone https://github.com/psmman/tablediff-arrow.git
cd tablediff-arrow

# Install with development dependencies
pip install -e ".[dev]"

# Install pre-commit hooks
pre-commit install
```

### Running Tests

```bash
# Run all tests
pytest

# Run with coverage
pytest --cov=tablediff_arrow --cov-report=html

# Run specific test file
pytest tests/test_compare.py
```

### Code Quality

```bash
# Format code
black src tests

# Lint
ruff check src tests

# Type check
mypy src
```

### Pre-commit Hooks

The project uses pre-commit hooks to ensure code quality:

- trailing-whitespace: Remove trailing whitespace
- end-of-file-fixer: Ensure files end with a newline
- check-yaml/json/toml: Validate config files
- black: Format Python code
- ruff: Lint Python code
- mypy: Type checking

## Requirements

- Python 3.10 or higher
- pyarrow >= 14.0.0
- pandas >= 2.0.0
- click >= 8.0.0
- jinja2 >= 3.0.0
- s3fs >= 2023.0.0 (optional, for S3 support)

## License

MIT License - see [LICENSE](LICENSE) file for details.

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.


            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "tablediff-arrow",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "diff, parquet, csv, arrow, data-comparison",
    "author": null,
    "author_email": "Prasenjit Singh <psmman@users.noreply.github.com>",
    "download_url": "https://files.pythonhosted.org/packages/e8/27/c5c40e7cf36b3893155b40e6fd3274944d40a8dfc0696cac74df54599919/tablediff_arrow-0.1.0.tar.gz",
    "platform": null,
    "description": "\ufeff# tablediff-arrow\r\n\r\nFast, file-based diffs for Parquet/CSV/Arrow (local or S3) with keyed comparisons, per-column tolerances, and HTML/CSV reports\u2014built on Apache Arrow.\r\n\r\n[![CI](https://github.com/psmman/tablediff-arrow/workflows/CI/badge.svg)](https://github.com/psmman/tablediff-arrow/actions)\r\n[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)\r\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\r\n\r\n## Features\r\n\r\n- **Fast**: Built on Apache Arrow for high-performance data processing\r\n- **Multiple Formats**: Support for Parquet, CSV, and Arrow IPC files\r\n- **S3 Support**: Read files directly from S3 (optional)\r\n- **Keyed Comparisons**: Compare tables using one or more key columns\r\n- **Numeric Tolerances**: Configure absolute and relative tolerances for numeric columns\r\n- **Rich Reports**: Generate HTML and CSV reports with detailed differences\r\n- **Python 3.10+**: Modern Python with type hints and clean APIs\r\n- **Well Tested**: Comprehensive test suite with high coverage\r\n\r\n## Installation\r\n\r\n```bash\r\npip install tablediff-arrow\r\n```\r\n\r\nFor S3 support:\r\n\r\n```bash\r\npip install tablediff-arrow[s3]\r\n```\r\n\r\nFor development:\r\n\r\n```bash\r\npip install -e \".[dev]\"\r\n```\r\n\r\n## Quick Start\r\n\r\n### Command Line Interface\r\n\r\nCompare two Parquet files using `id` as the key column:\r\n\r\n```bash\r\ntablediff left.parquet right.parquet -k id\r\n```\r\n\r\nCompare with numeric tolerance:\r\n\r\n```bash\r\ntablediff left.csv right.csv -k id -t amount:0.01\r\n```\r\n\r\nGenerate an HTML report:\r\n\r\n```bash\r\ntablediff left.parquet right.parquet -k id -o report.html\r\n```\r\n\r\nCompare S3 files:\r\n\r\n```bash\r\ntablediff s3://bucket/left.parquet s3://bucket/right.parquet -k id --s3\r\n```\r\n\r\n### Python API\r\n\r\n```python\r\nfrom tablediff_arrow import TableDiff\r\n\r\n# Create a differ with key columns and tolerances\r\ndiffer = TableDiff(\r\n    key_columns=['id'],\r\n    tolerance={'amount': 0.01},  # Absolute tolerance\r\n    relative_tolerance={'price': 0.001}  # Relative tolerance (0.1%)\r\n)\r\n\r\n# Compare files\r\nresult = differ.compare_files('left.parquet', 'right.parquet')\r\n\r\n# Print summary\r\nprint(result.summary())\r\n\r\n# Check if there are differences\r\nif result.has_differences:\r\n    print(f\"Found {result.changed_rows} changed rows\")\r\n    print(f\"Found {result.left_only_rows} rows only in left\")\r\n    print(f\"Found {result.right_only_rows} rows only in right\")\r\n\r\n# Generate reports\r\nfrom tablediff_arrow.reports import generate_html_report, generate_csv_report\r\n\r\ngenerate_html_report(result, 'report.html')\r\ngenerate_csv_report(result, 'output_dir/', prefix='diff')\r\n```\r\n\r\n## Usage Examples\r\n\r\n### Multiple Key Columns\r\n\r\nCompare tables using composite keys:\r\n\r\n```bash\r\ntablediff left.parquet right.parquet -k year -k month -k product\r\n```\r\n\r\n```python\r\ndiffer = TableDiff(key_columns=['year', 'month', 'product'])\r\nresult = differ.compare_files('left.parquet', 'right.parquet')\r\n```\r\n\r\n### Numeric Tolerances\r\n\r\nUse absolute tolerance for monetary values:\r\n\r\n```bash\r\ntablediff left.csv right.csv -k id -t amount:0.01 -t balance:0.001\r\n```\r\n\r\nUse relative tolerance for percentages:\r\n\r\n```bash\r\ntablediff left.csv right.csv -k id -r rate:0.001 -r score:0.01\r\n```\r\n\r\n```python\r\ndiffer = TableDiff(\r\n    key_columns=['id'],\r\n    tolerance={'amount': 0.01, 'balance': 0.001},\r\n    relative_tolerance={'rate': 0.001, 'score': 0.01}\r\n)\r\n```\r\n\r\n### Working with PyArrow Tables\r\n\r\n```python\r\nimport pyarrow as pa\r\nfrom tablediff_arrow import TableDiff\r\n\r\n# Create tables directly\r\nleft = pa.table({'id': [1, 2, 3], 'value': [10, 20, 30]})\r\nright = pa.table({'id': [1, 2, 3], 'value': [10, 21, 30]})\r\n\r\n# Compare\r\ndiffer = TableDiff(key_columns=['id'])\r\nresult = differ.compare_tables(left, right)\r\n\r\nprint(result.summary())\r\n```\r\n\r\n### S3 Files\r\n\r\n```python\r\nimport s3fs\r\nfrom tablediff_arrow import TableDiff\r\n\r\n# Create S3 filesystem\r\nfs = s3fs.S3FileSystem()\r\n\r\n# Compare S3 files\r\ndiffer = TableDiff(key_columns=['id'])\r\nresult = differ.compare_files(\r\n    's3://my-bucket/left.parquet',\r\n    's3://my-bucket/right.parquet',\r\n    filesystem=fs\r\n)\r\n```\r\n\r\n## CLI Options\r\n\r\n```\r\nUsage: tablediff [OPTIONS] LEFT RIGHT\r\n\r\n  Compare two tables and generate diff reports.\r\n\r\nArguments:\r\n  LEFT   Path to the left/source table file (local or s3://)\r\n  RIGHT  Path to the right/target table file (local or s3://)\r\n\r\nOptions:\r\n  -k, --key TEXT              Key column(s) for comparison (required, can be\r\n                              specified multiple times)\r\n  -t, --tolerance TEXT        Absolute tolerance for numeric columns\r\n                              (format: column:value)\r\n  -r, --relative-tolerance    Relative tolerance for numeric columns\r\n                              (format: column:value)\r\n  --left-format [parquet|csv|arrow]\r\n                              Format of the left file\r\n  --right-format [parquet|csv|arrow]\r\n                              Format of the right file\r\n  -o, --output TEXT           Output file path for HTML report\r\n  --csv-output PATH           Output directory for CSV reports\r\n  --s3                        Enable S3 filesystem support\r\n  --help                      Show this message and exit.\r\n```\r\n\r\n## Output Reports\r\n\r\n### HTML Report\r\n\r\nThe HTML report provides an interactive view of differences:\r\n\r\n- Summary statistics (matched, changed, added, removed rows)\r\n- Color-coded differences table\r\n- Separate sections for left-only and right-only rows\r\n- Change counts per column\r\n\r\n### CSV Reports\r\n\r\nCSV output generates multiple files:\r\n\r\n- `{prefix}_summary.csv`: Summary statistics\r\n- `{prefix}_changes.csv`: Detailed changes with old and new values\r\n- `{prefix}_left_only.csv`: Rows only in the left table\r\n- `{prefix}_right_only.csv`: Rows only in the right table\r\n\r\n## Development\r\n\r\n### Setup\r\n\r\n```bash\r\n# Clone the repository\r\ngit clone https://github.com/psmman/tablediff-arrow.git\r\ncd tablediff-arrow\r\n\r\n# Install with development dependencies\r\npip install -e \".[dev]\"\r\n\r\n# Install pre-commit hooks\r\npre-commit install\r\n```\r\n\r\n### Running Tests\r\n\r\n```bash\r\n# Run all tests\r\npytest\r\n\r\n# Run with coverage\r\npytest --cov=tablediff_arrow --cov-report=html\r\n\r\n# Run specific test file\r\npytest tests/test_compare.py\r\n```\r\n\r\n### Code Quality\r\n\r\n```bash\r\n# Format code\r\nblack src tests\r\n\r\n# Lint\r\nruff check src tests\r\n\r\n# Type check\r\nmypy src\r\n```\r\n\r\n### Pre-commit Hooks\r\n\r\nThe project uses pre-commit hooks to ensure code quality:\r\n\r\n- trailing-whitespace: Remove trailing whitespace\r\n- end-of-file-fixer: Ensure files end with a newline\r\n- check-yaml/json/toml: Validate config files\r\n- black: Format Python code\r\n- ruff: Lint Python code\r\n- mypy: Type checking\r\n\r\n## Requirements\r\n\r\n- Python 3.10 or higher\r\n- pyarrow >= 14.0.0\r\n- pandas >= 2.0.0\r\n- click >= 8.0.0\r\n- jinja2 >= 3.0.0\r\n- s3fs >= 2023.0.0 (optional, for S3 support)\r\n\r\n## License\r\n\r\nMIT License - see [LICENSE](LICENSE) file for details.\r\n\r\n## Contributing\r\n\r\nContributions are welcome! Please feel free to submit a Pull Request.\r\n\r\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Fast, file-based diffs for Parquet/CSV/Arrow (local or S3) with keyed comparisons, per-column tolerances, and HTML/CSV reports\u2014built on Apache Arrow.",
    "version": "0.1.0",
    "project_urls": {
        "Homepage": "https://github.com/psmman/tablediff-arrow",
        "Issues": "https://github.com/psmman/tablediff-arrow/issues",
        "Repository": "https://github.com/psmman/tablediff-arrow"
    },
    "split_keywords": [
        "diff",
        " parquet",
        " csv",
        " arrow",
        " data-comparison"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "495b7ac64916f844840ef32f42e0905f5ba600f74b2e0b759dd16252800bede3",
                "md5": "153f1725fe26cef3479f2c8de1d4e0cc",
                "sha256": "f3abb723c8e8058d8288c43048c86f14b2b8b289631e4f12d879096c53b9bf61"
            },
            "downloads": -1,
            "filename": "tablediff_arrow-0.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "153f1725fe26cef3479f2c8de1d4e0cc",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 12335,
            "upload_time": "2025-10-13T05:51:51",
            "upload_time_iso_8601": "2025-10-13T05:51:51.756045Z",
            "url": "https://files.pythonhosted.org/packages/49/5b/7ac64916f844840ef32f42e0905f5ba600f74b2e0b759dd16252800bede3/tablediff_arrow-0.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "e827c5c40e7cf36b3893155b40e6fd3274944d40a8dfc0696cac74df54599919",
                "md5": "a19ce70975204c8bd34239f2300de4b9",
                "sha256": "c23fb28970c27f095d8193710e0825d69ad5ad1120ab5c3189a3d51ed95d82c1"
            },
            "downloads": -1,
            "filename": "tablediff_arrow-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "a19ce70975204c8bd34239f2300de4b9",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 16390,
            "upload_time": "2025-10-13T05:51:53",
            "upload_time_iso_8601": "2025-10-13T05:51:53.080725Z",
            "url": "https://files.pythonhosted.org/packages/e8/27/c5c40e7cf36b3893155b40e6fd3274944d40a8dfc0696cac74df54599919/tablediff_arrow-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-13 05:51:53",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "psmman",
    "github_project": "tablediff-arrow",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "tablediff-arrow"
}
        
Elapsed time: 1.55937s