tabularcompare


Nametabularcompare JSON
Version 0.0.4 PyPI version JSON
download
home_page
SummaryTabular data comparison wrapper for DataComPy
upload_time2023-04-22 06:47:51
maintainer
docs_urlNone
author
requires_python>=3.8
licenseApache-2.0
keywords tabular compare data
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # TabularCompare

[![Python package](https://github.com/erich-hs/tabularcompare/actions/workflows/python-package.yml/badge.svg)](https://github.com/erich-hs/tabularcompare/actions/workflows/python-package.yml) [![PyPI license](https://img.shields.io/pypi/l/tabularcompare)](https://pypi.python.org/pypi/tabularcompare/) [![PyPI pyversions](https://img.shields.io/pypi/pyversions/tabularcompare.svg)](https://img.shields.io/pypi/pyversions/tabularcompare) ![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)

Tabular data comparison wrapper for [DataComPy](https://capitalone.github.io/datacompy/).

## Quick Install
```bash
pip install tabularcompare
```

## Basic Usage

### Comparison object
```python
import pandas as pd
from tabularcompare import Comparison

df1 = pd.DataFrame(
    {
        "idx1": ["A", "B", "B", "C"],
        "idx2": ["01", "01", "02", "03"],
        "colA": ["AA", "BA", "BB", "CA"],
        "colB": [100, 200, 200, 300]
     }
)
df2 = pd.DataFrame(
    {
        "idx1": ["A", "B", "C"],
        "idx2": ["01", "01", "03"],
        "colA": ["AA", "XA", "CA"],
        "colB": [101, 200, 300],
        "colC": ["foo", "bar", "baz"]
     }
)

comparison = Comparison(
    df1, df2, join_columns=["idx1", "idx2"]
)
```
### Added Functionalities
- #### Diverging Subset
This method introduces an enhanced look at the changes identified at the intersection of compared DataFrames, following the notation ```{df1} --> {df2}```.
```python
comparison.diverging_subset()
```
|   | idx1 | idx2 |          colA |            colB |
|:-:|-----:|-----:|--------------:|----------------:|
| 0 | A    | 01   | NaN           | {100} --> {101} |
| 1 | B    | 01   | {BA} --> {XA} | NaN             |

Rows that are unique to either DataFrame can be called via ```.df1_unq_rows()``` and ```.df2_unq_rows()``` methods.

```python
comparison.df1_unq_rows()
```
|   | idx1 | idx2 |	colA | colB |
|:-:|-----:|-----:|-----:|-----:|
| 2 |	B  |  02  |  BB  |	200 |

Columns that are unique to either DataFrame can be called via ```.df1_unq_columns()``` and ```.df2_unq_columns()``` methods.

```python
comparison.df2_unq_columns()
```
|   | idx1 | idx2 |	 colC |
|:-:|-----:|-----:|------:|
| 0 |	A  |  01  |  foo  |
| 1 |	B  |  01  |  bar  |
| 2 |	C  |  03  |  baz  |

- #### Enhanced Reporting
The report functionality is now callable from the comparison object. It includes a .txt, .html, and .xlsx version.
```python
comparison.report_to_txt("./results/Report.txt")
comparison.report_to_html("./results/Report.html")
comparison.report_to_xlsx("./results/Report.xlsx", write_originals=True)
```
The Excel report will output complete comparison results, with tabs dedicated to:
* Original dataframes (when ```write_originals=True```).
* Columns present only on df1 and/or df2.
* Rows present only on df1 and/or df2.
* Diverging subset showing all the changes identified from df1 to df2.

The HTML report will also output a rendered table of the diverging subset on top of the file, alongside the native DataComPy summary report.

- #### Fully-fledged [Command Line Interface](#CLI)
---
### DataComPy Methods
Most methods native to ```datacompy.Compare``` functionality are still present, including;
* ```.report()```
* ```.df1_unq_rows()``` / ```.df2_unq_rows()```
* ```.df1_unq_columns()``` / ```.df2_unq_columns()```
* ```.intersect_columns()```
* ```.intersect_rows()```

The native ```datacompy.Compare``` method is also callable from the ```tabularcompare``` core module:
```python
from tabularcompare import Compare

# datacompy.Compare method
comparison = Compare(
    df1, df2, join_columns=["idx1", "idx2"]
)
print(comparison.report())
```
For a complete documentation on DataComPy you can head to [DataComPy](https://capitalone.github.io/datacompy/).

-----
## CLI
Command Line Interface to output an Excel .xlsx report and, optionally, html and txt summaries.
```bash
tabularcompare --help
```

```
Usage: tabularcompare [OPTIONS] DF1 DF2

Options:
  -c, --columns, --on TEXT    Comma-separated list of key column(s) to compare
                              df1 against df2. When not provided, df1 and df2
                              will be matched on index.
  -ic, --ignore_columns TEXT  Comma-separated list of column(s) to ignore from
                              df1 and df2.
  -n1, --df1_name TEXT        Alias for Data frame 1. Default = df1
  -n2, --df2_name TEXT        Alias for Data frame 2. Default = df2
  -is, --ignore_spaces        Flag to strip and ignore whitespaces from string
                              columns.
  -ci, --case_insensitive     Flag to compare string columns on a case-
                              insensitive manner.
  -cl, --cast_lowercase       Flag to cast column names to lower case before
                              comparison.
  -at, --abs_tol FLOAT        Absolute tolerance between two numeric values.
  -rt, --rel_tol FLOAT        Relative tolerance between two numeric values.
  -txt, --txt                 Flag to output a .txt report with a comparison
                              summary.
  -html, --html               Flag to output an HTML report with a comparison
                              summary.
  -od, --only_deltas          Flag to suppress original dataframes from the
                              output .xlsx report.
  -o, --output, --out PATH    Output location for report files. Defaults to
                              current location.
  -e, --encoding TEXT         Character encoding to read df1 and df2.
  -v, --verbose               Verbosity.
  --help                      Show this message and exit.
```

### Sample usage
The application reads from two file paths input for csv, json, or excel files.
```bash
cd ./data/
tabularcompare ./df1.csv ./df2.csv -c 'idx1,idx2' -n1 myTable1 -n2 myTable2 -o ../results/
```

-----
## Caveat
> The comparison results will take into account data types across columns. I.E. If we update our sample dataframe df2 to include a missing value on colB, it will now be of dtype ```object```, as oposed to ```Int64``` in df1. This might lead to miss-leading interpretations of the results to users without information on data types.

```python
import pandas as pd
from tabularcompare import Comparison

df1 = pd.DataFrame(
    {
        "idx1": ["A", "B", "B", "C"],
        "idx2": ["01", "01", "02", "03"],
        "colA": ["AA", "BA", "BB", "CA"],
        "colB": [100, 200, 200, 300]
     }
)
df2 = pd.DataFrame(
    {
        "idx1": ["A", "B", "C"],
        "idx2": ["01", "01", "03"],
        "colA": ["AA", "XA", "CA"],
        "colB": [pd.NA, 200, 300],
        "colC": ["foo", "bar", "baz"]
     }
)

comparison = Comparison(
    df1, df2, join_columns=["idx1", "idx2"]
)
comparison.diverging_subset()
```
Which will return:
|   | idx1 | idx2 |          colA |            colB |
|:-:|-----:|-----:|--------------:|----------------:|
| 0 | A    | 01   | NaN           | {100} --> {}    |
| 1 | B    | 01   | {BA} --> {XA} | {200} --> {200} |
| 3 | C    | 03   | NaN           | {300} --> {300} |

            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "tabularcompare",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "",
    "keywords": "Tabular,Compare,Data",
    "author": "",
    "author_email": "Erich Henrique <erich.hs13@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/cc/4a/c8570559c5876a122814cad5b1abac315d0d73bc364475cc3953e48d256a/tabularcompare-0.0.4.tar.gz",
    "platform": null,
    "description": "# TabularCompare\n\n[![Python package](https://github.com/erich-hs/tabularcompare/actions/workflows/python-package.yml/badge.svg)](https://github.com/erich-hs/tabularcompare/actions/workflows/python-package.yml) [![PyPI license](https://img.shields.io/pypi/l/tabularcompare)](https://pypi.python.org/pypi/tabularcompare/) [![PyPI pyversions](https://img.shields.io/pypi/pyversions/tabularcompare.svg)](https://img.shields.io/pypi/pyversions/tabularcompare) ![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)\n\nTabular data comparison wrapper for [DataComPy](https://capitalone.github.io/datacompy/).\n\n## Quick Install\n```bash\npip install tabularcompare\n```\n\n## Basic Usage\n\n### Comparison object\n```python\nimport pandas as pd\nfrom tabularcompare import Comparison\n\ndf1 = pd.DataFrame(\n    {\n        \"idx1\": [\"A\", \"B\", \"B\", \"C\"],\n        \"idx2\": [\"01\", \"01\", \"02\", \"03\"],\n        \"colA\": [\"AA\", \"BA\", \"BB\", \"CA\"],\n        \"colB\": [100, 200, 200, 300]\n     }\n)\ndf2 = pd.DataFrame(\n    {\n        \"idx1\": [\"A\", \"B\", \"C\"],\n        \"idx2\": [\"01\", \"01\", \"03\"],\n        \"colA\": [\"AA\", \"XA\", \"CA\"],\n        \"colB\": [101, 200, 300],\n        \"colC\": [\"foo\", \"bar\", \"baz\"]\n     }\n)\n\ncomparison = Comparison(\n    df1, df2, join_columns=[\"idx1\", \"idx2\"]\n)\n```\n### Added Functionalities\n- #### Diverging Subset\nThis method introduces an enhanced look at the changes identified at the intersection of compared DataFrames, following the notation ```{df1} --> {df2}```.\n```python\ncomparison.diverging_subset()\n```\n|   | idx1 | idx2 |          colA |            colB |\n|:-:|-----:|-----:|--------------:|----------------:|\n| 0 | A    | 01   | NaN           | {100} --> {101} |\n| 1 | B    | 01   | {BA} --> {XA} | NaN             |\n\nRows that are unique to either DataFrame can be called via ```.df1_unq_rows()``` and ```.df2_unq_rows()``` methods.\n\n```python\ncomparison.df1_unq_rows()\n```\n|   | idx1 | idx2 |\tcolA | colB |\n|:-:|-----:|-----:|-----:|-----:|\n| 2 |\tB  |  02  |  BB  |\t200 |\n\nColumns that are unique to either DataFrame can be called via ```.df1_unq_columns()``` and ```.df2_unq_columns()``` methods.\n\n```python\ncomparison.df2_unq_columns()\n```\n|   | idx1 | idx2 |\t colC |\n|:-:|-----:|-----:|------:|\n| 0 |\tA  |  01  |  foo  |\n| 1 |\tB  |  01  |  bar  |\n| 2 |\tC  |  03  |  baz  |\n\n- #### Enhanced Reporting\nThe report functionality is now callable from the comparison object. It includes a .txt, .html, and .xlsx version.\n```python\ncomparison.report_to_txt(\"./results/Report.txt\")\ncomparison.report_to_html(\"./results/Report.html\")\ncomparison.report_to_xlsx(\"./results/Report.xlsx\", write_originals=True)\n```\nThe Excel report will output complete comparison results, with tabs dedicated to:\n* Original dataframes (when ```write_originals=True```).\n* Columns present only on df1 and/or df2.\n* Rows present only on df1 and/or df2.\n* Diverging subset showing all the changes identified from df1 to df2.\n\nThe HTML report will also output a rendered table of the diverging subset on top of the file, alongside the native DataComPy summary report.\n\n- #### Fully-fledged [Command Line Interface](#CLI)\n---\n### DataComPy Methods\nMost methods native to ```datacompy.Compare``` functionality are still present, including;\n* ```.report()```\n* ```.df1_unq_rows()``` / ```.df2_unq_rows()```\n* ```.df1_unq_columns()``` / ```.df2_unq_columns()```\n* ```.intersect_columns()```\n* ```.intersect_rows()```\n\nThe native ```datacompy.Compare``` method is also callable from the ```tabularcompare``` core module:\n```python\nfrom tabularcompare import Compare\n\n# datacompy.Compare method\ncomparison = Compare(\n    df1, df2, join_columns=[\"idx1\", \"idx2\"]\n)\nprint(comparison.report())\n```\nFor a complete documentation on DataComPy you can head to [DataComPy](https://capitalone.github.io/datacompy/).\n\n-----\n## CLI\nCommand Line Interface to output an Excel .xlsx report and, optionally, html and txt summaries.\n```bash\ntabularcompare --help\n```\n\n```\nUsage: tabularcompare [OPTIONS] DF1 DF2\n\nOptions:\n  -c, --columns, --on TEXT    Comma-separated list of key column(s) to compare\n                              df1 against df2. When not provided, df1 and df2\n                              will be matched on index.\n  -ic, --ignore_columns TEXT  Comma-separated list of column(s) to ignore from\n                              df1 and df2.\n  -n1, --df1_name TEXT        Alias for Data frame 1. Default = df1\n  -n2, --df2_name TEXT        Alias for Data frame 2. Default = df2\n  -is, --ignore_spaces        Flag to strip and ignore whitespaces from string\n                              columns.\n  -ci, --case_insensitive     Flag to compare string columns on a case-\n                              insensitive manner.\n  -cl, --cast_lowercase       Flag to cast column names to lower case before\n                              comparison.\n  -at, --abs_tol FLOAT        Absolute tolerance between two numeric values.\n  -rt, --rel_tol FLOAT        Relative tolerance between two numeric values.\n  -txt, --txt                 Flag to output a .txt report with a comparison\n                              summary.\n  -html, --html               Flag to output an HTML report with a comparison\n                              summary.\n  -od, --only_deltas          Flag to suppress original dataframes from the\n                              output .xlsx report.\n  -o, --output, --out PATH    Output location for report files. Defaults to\n                              current location.\n  -e, --encoding TEXT         Character encoding to read df1 and df2.\n  -v, --verbose               Verbosity.\n  --help                      Show this message and exit.\n```\n\n### Sample usage\nThe application reads from two file paths input for csv, json, or excel files.\n```bash\ncd ./data/\ntabularcompare ./df1.csv ./df2.csv -c 'idx1,idx2' -n1 myTable1 -n2 myTable2 -o ../results/\n```\n\n-----\n## Caveat\n> The comparison results will take into account data types across columns. I.E. If we update our sample dataframe df2 to include a missing value on colB, it will now be of dtype ```object```, as oposed to ```Int64``` in df1. This might lead to miss-leading interpretations of the results to users without information on data types.\n\n```python\nimport pandas as pd\nfrom tabularcompare import Comparison\n\ndf1 = pd.DataFrame(\n    {\n        \"idx1\": [\"A\", \"B\", \"B\", \"C\"],\n        \"idx2\": [\"01\", \"01\", \"02\", \"03\"],\n        \"colA\": [\"AA\", \"BA\", \"BB\", \"CA\"],\n        \"colB\": [100, 200, 200, 300]\n     }\n)\ndf2 = pd.DataFrame(\n    {\n        \"idx1\": [\"A\", \"B\", \"C\"],\n        \"idx2\": [\"01\", \"01\", \"03\"],\n        \"colA\": [\"AA\", \"XA\", \"CA\"],\n        \"colB\": [pd.NA, 200, 300],\n        \"colC\": [\"foo\", \"bar\", \"baz\"]\n     }\n)\n\ncomparison = Comparison(\n    df1, df2, join_columns=[\"idx1\", \"idx2\"]\n)\ncomparison.diverging_subset()\n```\nWhich will return:\n|   | idx1 | idx2 |          colA |            colB |\n|:-:|-----:|-----:|--------------:|----------------:|\n| 0 | A    | 01   | NaN           | {100} --> {}    |\n| 1 | B    | 01   | {BA} --> {XA} | {200} --> {200} |\n| 3 | C    | 03   | NaN           | {300} --> {300} |\n",
    "bugtrack_url": null,
    "license": "Apache-2.0",
    "summary": "Tabular data comparison wrapper for DataComPy",
    "version": "0.0.4",
    "split_keywords": [
        "tabular",
        "compare",
        "data"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "c111cee5fdddaca4405fc04ec18c6a0c24975ee49ab1985cf4eefd9e1dab456d",
                "md5": "b79e7c984ccc06c3f8dbbecb4a8d6ed3",
                "sha256": "a870403a30b004b8ba4a9cded52d13795798ea2d9c9828b6a46d57022854bb83"
            },
            "downloads": -1,
            "filename": "tabularcompare-0.0.4-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "b79e7c984ccc06c3f8dbbecb4a8d6ed3",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 14805,
            "upload_time": "2023-04-22T06:47:49",
            "upload_time_iso_8601": "2023-04-22T06:47:49.846262Z",
            "url": "https://files.pythonhosted.org/packages/c1/11/cee5fdddaca4405fc04ec18c6a0c24975ee49ab1985cf4eefd9e1dab456d/tabularcompare-0.0.4-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "cc4ac8570559c5876a122814cad5b1abac315d0d73bc364475cc3953e48d256a",
                "md5": "4e849861fc8c2eb4bb5a5755958913b3",
                "sha256": "b4041418eead523bfe73de75481632332dfda77725b202fec471c01dd5d0db2d"
            },
            "downloads": -1,
            "filename": "tabularcompare-0.0.4.tar.gz",
            "has_sig": false,
            "md5_digest": "4e849861fc8c2eb4bb5a5755958913b3",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 19726,
            "upload_time": "2023-04-22T06:47:51",
            "upload_time_iso_8601": "2023-04-22T06:47:51.652511Z",
            "url": "https://files.pythonhosted.org/packages/cc/4a/c8570559c5876a122814cad5b1abac315d0d73bc364475cc3953e48d256a/tabularcompare-0.0.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-04-22 06:47:51",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "lcname": "tabularcompare"
}
        
Elapsed time: 0.06307s