reladiff


Namereladiff JSON
Version 0.5.3 PyPI version JSON
download
home_pagehttps://github.com/erezsh/reladiff
SummaryCommand-line tool and Python library to efficiently diff rows across two different databases.
upload_time2024-08-20 13:30:01
maintainerNone
docs_urlNone
authorErez Shinan
requires_python<4.0,>=3.8
licenseMIT
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            ![](reladiff_logo.svg)

&nbsp;
<br/>
<br/>
<span style="font-size:1.3em">**Reladiff**</span> is a high-performance tool and library designed for diffing large datasets across databases. By executing the diff calculation within the database itself, Reladiff minimizes data transfer and achieves optimal performance.

This tool is specifically tailored for data professionals, DevOps engineers, and system administrators.

Reladiff is free, open-source, user-friendly, extensively tested, and delivers fast results, even at massive scale.

### Key Features:

 1. **Cross-Database Diff**: Reladiff employs a divide-and-conquer algorithm, based on matching hashes, to efficiently identify modified segments and download only the necessary data for comparison. This approach ensures exceptional performance when differences are minimal.

    - ⇄  Diffs across over a dozen different databases (e.g. *PostgreSQL* -> *Snowflake*) !

    - 🧠 Gracefully handles reduced precision (e.g., timestamp(9) -> timestamp(3)) by rounding according to the database specification.

    - πŸ”₯ Benchmarked to diff over 25M rows in under 10 seconds and over 1B rows in approximately 5 minutes, given no differences.

    - ♾️ Capable of handling tables with tens of billions of rows.


2. **Intra-Database Diff**: When both tables reside in the same database, Reladiff compares them using a join operation, with additional optimizations for enhanced speed.

    - Supports materializing the diff into a local table.
    - Can collect various extra statistics about the tables.

3. **Threaded**: Utilizes multiple threads to significantly boost performance during diffing operations.

3. **Configurable**: Offers numerous options for power-users to customize and optimize their usage.

4. **Automation-Friendly**: Outputs both JSON and git-like diffs (with + and -), facilitating easy integration into CI/CD pipelines.

5. **Over a dozen databases supported**. MySQL, Postgres, Snowflake, Bigquery, Oracle, Clickhouse, and more. [See full list](https://reladiff.readthedocs.io/en/latest/supported-databases.html)


Reladiff is a fork of an archived project called [data-diff](https://github.com/datafold/data-diff).

## Get Started

[**πŸ—Ž Read the Documentation**](https://reladiff.readthedocs.io/en/latest/) - our detailed documentation has everything you need to start diffing.

## Quickstart

For the impatient ;)

### Install

Reladiff is available on [PyPI](https://pypi.org/project/reladiff/). You may install it by running:

```
pip install reladiff
```

Requires Python 3.8+ with pip.

We advise to install it within a virtual-env.

### How to Use

Once you've installed Reladiff, you can run it from the command-line:

```bash
# Cross-DB diff, using hashes
reladiff  DB1_URI  TABLE1_NAME  DB2_URI  TABLE2_NAME  [OPTIONS]
```

When both tables belong to the same database, a shorter syntax is available:

```bash
# Same-DB diff, using outer join
reladiff  DB1_URI  TABLE1_NAME  TABLE2_NAME  [OPTIONS]
```

Or, you can import and run it from Python:

```python
from reladiff import connect_to_table, diff_tables

table1 = connect_to_table("postgresql:///", "table_name", "id")
table2 = connect_to_table("mysql:///", "table_name", "id")

sign: Literal['+' | '-']
row: tuple[str, ...]
for sign, row in diff_tables(table1, table2):
    print(sign, row)
```

Read our detailed instructions:

* [How to use from the shell / command-line](https://reladiff.readthedocs.io/en/latest/how-to-use.html#how-to-use-from-the-shell-or-command-line)
    * [How to use with TOML configuration file](https://reladiff.readthedocs.io/en/latest/how-to-use.html#how-to-use-with-a-configuration-file)
* [How to use from Python](https://reladiff.readthedocs.io/en/latest/how-to-use.html#how-to-use-from-python)


#### "Real-world" example: Diff "events" table between Postgres and Snowflake

```
reladiff \
  postgresql:/// \
  events \
  "snowflake://<username>:<password>@<password>/<DATABASE>/<SCHEMA>?warehouse=<WAREHOUSE>&role=<ROLE>" \
  events \
  -k event_id \         # Identifier of event
  -c event_data \       # Extra column to compare
  -w "event_time < '2024-10-10'"    # Filter the rows on both dbs
```

#### "Real-world" example: Diff "events" and "old_events" tables in the same Postgres DB

Materializes the results into a new table, containing the current timestamp in its name.

```
reladiff \
  postgresql:///  events  old_events \
  -k org_id \
  -c created_at -c is_internal \
  -w "org_id != 1 and org_id < 2000" \
  -m test_results_%t \
  --materialize-all-rows \
  --table-write-limit 10000
```

### Technical Explanation

Check out this [technical explanation](https://reladiff.readthedocs.io/en/latest/technical-explanation.html) of how cross-database reladiff works.

### We're here to help!

* Confused? Got a cool idea? Just want to share your thoughts? Let's discuss it in [GitHub Discussions](https://github.com/erezsh/reladiff/discussions).

* Did you encounter a bug? [Open an issue](https://github.com/erezsh/reladiff/issues).

## How to Contribute
* Please read the [contributing guidelines](https://github.com/erezsh/reladiff/blob/master/CONTRIBUTING.md) to get started.
* Feel free to open a new issue or work on an existing one.

Big thanks to everyone who contributed so far:

<a href="https://github.com/erezsh/reladiff/graphs/contributors">
  <img src="https://contributors-img.web.app/image?repo=erezsh/reladiff" />
</a>


## License

This project is licensed under the terms of the [MIT License](https://github.com/erezsh/reladiff/blob/master/LICENSE).

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/erezsh/reladiff",
    "name": "reladiff",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.8",
    "maintainer_email": null,
    "keywords": null,
    "author": "Erez Shinan",
    "author_email": "erezshin@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/af/ef/7719254b0100a5730ef0937be67d4255147744066d9eeda499a70625ca04/reladiff-0.5.3.tar.gz",
    "platform": null,
    "description": "![](reladiff_logo.svg)\n\n&nbsp;\n<br/>\n<br/>\n<span style=\"font-size:1.3em\">**Reladiff**</span> is a high-performance tool and library designed for diffing large datasets across databases. By executing the diff calculation within the database itself, Reladiff minimizes data transfer and achieves optimal performance.\n\nThis tool is specifically tailored for data professionals, DevOps engineers, and system administrators.\n\nReladiff is free, open-source, user-friendly, extensively tested, and delivers fast results, even at massive scale.\n\n### Key Features:\n\n 1. **Cross-Database Diff**: Reladiff employs a divide-and-conquer algorithm, based on matching hashes, to efficiently identify modified segments and download only the necessary data for comparison. This approach ensures exceptional performance when differences are minimal.\n\n    - \u21c4  Diffs across over a dozen different databases (e.g. *PostgreSQL* -> *Snowflake*) !\n\n    - \ud83e\udde0 Gracefully handles reduced precision (e.g., timestamp(9) -> timestamp(3)) by rounding according to the database specification.\n\n    - \ud83d\udd25 Benchmarked to diff over 25M rows in under 10 seconds and over 1B rows in approximately 5 minutes, given no differences.\n\n    - \u267e\ufe0f Capable of handling tables with tens of billions of rows.\n\n\n2. **Intra-Database Diff**: When both tables reside in the same database, Reladiff compares them using a join operation, with additional optimizations for enhanced speed.\n\n    - Supports materializing the diff into a local table.\n    - Can collect various extra statistics about the tables.\n\n3. **Threaded**: Utilizes multiple threads to significantly boost performance during diffing operations.\n\n3. **Configurable**: Offers numerous options for power-users to customize and optimize their usage.\n\n4. **Automation-Friendly**: Outputs both JSON and git-like diffs (with + and -), facilitating easy integration into CI/CD pipelines.\n\n5. **Over a dozen databases supported**. MySQL, Postgres, Snowflake, Bigquery, Oracle, Clickhouse, and more. [See full list](https://reladiff.readthedocs.io/en/latest/supported-databases.html)\n\n\nReladiff is a fork of an archived project called [data-diff](https://github.com/datafold/data-diff).\n\n## Get Started\n\n[**\ud83d\uddce Read the Documentation**](https://reladiff.readthedocs.io/en/latest/) - our detailed documentation has everything you need to start diffing.\n\n## Quickstart\n\nFor the impatient ;)\n\n### Install\n\nReladiff is available on [PyPI](https://pypi.org/project/reladiff/). You may install it by running:\n\n```\npip install reladiff\n```\n\nRequires Python 3.8+ with pip.\n\nWe advise to install it within a virtual-env.\n\n### How to Use\n\nOnce you've installed Reladiff, you can run it from the command-line:\n\n```bash\n# Cross-DB diff, using hashes\nreladiff  DB1_URI  TABLE1_NAME  DB2_URI  TABLE2_NAME  [OPTIONS]\n```\n\nWhen both tables belong to the same database, a shorter syntax is available:\n\n```bash\n# Same-DB diff, using outer join\nreladiff  DB1_URI  TABLE1_NAME  TABLE2_NAME  [OPTIONS]\n```\n\nOr, you can import and run it from Python:\n\n```python\nfrom reladiff import connect_to_table, diff_tables\n\ntable1 = connect_to_table(\"postgresql:///\", \"table_name\", \"id\")\ntable2 = connect_to_table(\"mysql:///\", \"table_name\", \"id\")\n\nsign: Literal['+' | '-']\nrow: tuple[str, ...]\nfor sign, row in diff_tables(table1, table2):\n    print(sign, row)\n```\n\nRead our detailed instructions:\n\n* [How to use from the shell / command-line](https://reladiff.readthedocs.io/en/latest/how-to-use.html#how-to-use-from-the-shell-or-command-line)\n    * [How to use with TOML configuration file](https://reladiff.readthedocs.io/en/latest/how-to-use.html#how-to-use-with-a-configuration-file)\n* [How to use from Python](https://reladiff.readthedocs.io/en/latest/how-to-use.html#how-to-use-from-python)\n\n\n#### \"Real-world\" example: Diff \"events\" table between Postgres and Snowflake\n\n```\nreladiff \\\n  postgresql:/// \\\n  events \\\n  \"snowflake://<username>:<password>@<password>/<DATABASE>/<SCHEMA>?warehouse=<WAREHOUSE>&role=<ROLE>\" \\\n  events \\\n  -k event_id \\         # Identifier of event\n  -c event_data \\       # Extra column to compare\n  -w \"event_time < '2024-10-10'\"    # Filter the rows on both dbs\n```\n\n#### \"Real-world\" example: Diff \"events\" and \"old_events\" tables in the same Postgres DB\n\nMaterializes the results into a new table, containing the current timestamp in its name.\n\n```\nreladiff \\\n  postgresql:///  events  old_events \\\n  -k org_id \\\n  -c created_at -c is_internal \\\n  -w \"org_id != 1 and org_id < 2000\" \\\n  -m test_results_%t \\\n  --materialize-all-rows \\\n  --table-write-limit 10000\n```\n\n### Technical Explanation\n\nCheck out this [technical explanation](https://reladiff.readthedocs.io/en/latest/technical-explanation.html) of how cross-database reladiff works.\n\n### We're here to help!\n\n* Confused? Got a cool idea? Just want to share your thoughts? Let's discuss it in [GitHub Discussions](https://github.com/erezsh/reladiff/discussions).\n\n* Did you encounter a bug? [Open an issue](https://github.com/erezsh/reladiff/issues).\n\n## How to Contribute\n* Please read the [contributing guidelines](https://github.com/erezsh/reladiff/blob/master/CONTRIBUTING.md) to get started.\n* Feel free to open a new issue or work on an existing one.\n\nBig thanks to everyone who contributed so far:\n\n<a href=\"https://github.com/erezsh/reladiff/graphs/contributors\">\n  <img src=\"https://contributors-img.web.app/image?repo=erezsh/reladiff\" />\n</a>\n\n\n## License\n\nThis project is licensed under the terms of the [MIT License](https://github.com/erezsh/reladiff/blob/master/LICENSE).\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Command-line tool and Python library to efficiently diff rows across two different databases.",
    "version": "0.5.3",
    "project_urls": {
        "Documentation": "https://reladiff.readthedocs.io/en/latest/",
        "Homepage": "https://github.com/erezsh/reladiff",
        "Repository": "https://github.com/erezsh/reladiff"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "cf0699647ce4467f6c299bc76dff5c642f36a1698fa775e8811ec1451359fa78",
                "md5": "c7dbfbf5892aec9240b6298d908da6cc",
                "sha256": "f480309f3ee0c50539ae261b7d801a36e42171ffeee78b6aaa8a9eacc6fede15"
            },
            "downloads": -1,
            "filename": "reladiff-0.5.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "c7dbfbf5892aec9240b6298d908da6cc",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.8",
            "size": 40089,
            "upload_time": "2024-08-20T13:30:00",
            "upload_time_iso_8601": "2024-08-20T13:30:00.061065Z",
            "url": "https://files.pythonhosted.org/packages/cf/06/99647ce4467f6c299bc76dff5c642f36a1698fa775e8811ec1451359fa78/reladiff-0.5.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "afef7719254b0100a5730ef0937be67d4255147744066d9eeda499a70625ca04",
                "md5": "5c51ecaddf79c8c51b507414cc950a80",
                "sha256": "cfdefa7abc001eef0c3c5ceafbb024877436568fee7c9aedeaf214a8c91032ab"
            },
            "downloads": -1,
            "filename": "reladiff-0.5.3.tar.gz",
            "has_sig": false,
            "md5_digest": "5c51ecaddf79c8c51b507414cc950a80",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.8",
            "size": 34048,
            "upload_time": "2024-08-20T13:30:01",
            "upload_time_iso_8601": "2024-08-20T13:30:01.565920Z",
            "url": "https://files.pythonhosted.org/packages/af/ef/7719254b0100a5730ef0937be67d4255147744066d9eeda499a70625ca04/reladiff-0.5.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-08-20 13:30:01",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "erezsh",
    "github_project": "reladiff",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "reladiff"
}
        
Elapsed time: 0.72751s