Name | tabularcompare JSON |
Version |
0.0.4
JSON |
| download |
home_page | |
Summary | Tabular data comparison wrapper for DataComPy |
upload_time | 2023-04-22 06:47:51 |
maintainer | |
docs_url | None |
author | |
requires_python | >=3.8 |
license | Apache-2.0 |
keywords |
tabular
compare
data
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# TabularCompare
[![Python package](https://github.com/erich-hs/tabularcompare/actions/workflows/python-package.yml/badge.svg)](https://github.com/erich-hs/tabularcompare/actions/workflows/python-package.yml) [![PyPI license](https://img.shields.io/pypi/l/tabularcompare)](https://pypi.python.org/pypi/tabularcompare/) [![PyPI pyversions](https://img.shields.io/pypi/pyversions/tabularcompare.svg)](https://img.shields.io/pypi/pyversions/tabularcompare) ![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)
Tabular data comparison wrapper for [DataComPy](https://capitalone.github.io/datacompy/).
## Quick Install
```bash
pip install tabularcompare
```
## Basic Usage
### Comparison object
```python
import pandas as pd
from tabularcompare import Comparison
df1 = pd.DataFrame(
{
"idx1": ["A", "B", "B", "C"],
"idx2": ["01", "01", "02", "03"],
"colA": ["AA", "BA", "BB", "CA"],
"colB": [100, 200, 200, 300]
}
)
df2 = pd.DataFrame(
{
"idx1": ["A", "B", "C"],
"idx2": ["01", "01", "03"],
"colA": ["AA", "XA", "CA"],
"colB": [101, 200, 300],
"colC": ["foo", "bar", "baz"]
}
)
comparison = Comparison(
df1, df2, join_columns=["idx1", "idx2"]
)
```
### Added Functionalities
- #### Diverging Subset
This method introduces an enhanced look at the changes identified at the intersection of compared DataFrames, following the notation ```{df1} --> {df2}```.
```python
comparison.diverging_subset()
```
| | idx1 | idx2 | colA | colB |
|:-:|-----:|-----:|--------------:|----------------:|
| 0 | A | 01 | NaN | {100} --> {101} |
| 1 | B | 01 | {BA} --> {XA} | NaN |
Rows that are unique to either DataFrame can be called via ```.df1_unq_rows()``` and ```.df2_unq_rows()``` methods.
```python
comparison.df1_unq_rows()
```
| | idx1 | idx2 | colA | colB |
|:-:|-----:|-----:|-----:|-----:|
| 2 | B | 02 | BB | 200 |
Columns that are unique to either DataFrame can be called via ```.df1_unq_columns()``` and ```.df2_unq_columns()``` methods.
```python
comparison.df2_unq_columns()
```
| | idx1 | idx2 | colC |
|:-:|-----:|-----:|------:|
| 0 | A | 01 | foo |
| 1 | B | 01 | bar |
| 2 | C | 03 | baz |
- #### Enhanced Reporting
The report functionality is now callable from the comparison object. It includes a .txt, .html, and .xlsx version.
```python
comparison.report_to_txt("./results/Report.txt")
comparison.report_to_html("./results/Report.html")
comparison.report_to_xlsx("./results/Report.xlsx", write_originals=True)
```
The Excel report will output complete comparison results, with tabs dedicated to:
* Original dataframes (when ```write_originals=True```).
* Columns present only on df1 and/or df2.
* Rows present only on df1 and/or df2.
* Diverging subset showing all the changes identified from df1 to df2.
The HTML report will also output a rendered table of the diverging subset on top of the file, alongside the native DataComPy summary report.
- #### Fully-fledged [Command Line Interface](#CLI)
---
### DataComPy Methods
Most methods native to ```datacompy.Compare``` functionality are still present, including;
* ```.report()```
* ```.df1_unq_rows()``` / ```.df2_unq_rows()```
* ```.df1_unq_columns()``` / ```.df2_unq_columns()```
* ```.intersect_columns()```
* ```.intersect_rows()```
The native ```datacompy.Compare``` method is also callable from the ```tabularcompare``` core module:
```python
from tabularcompare import Compare
# datacompy.Compare method
comparison = Compare(
df1, df2, join_columns=["idx1", "idx2"]
)
print(comparison.report())
```
For a complete documentation on DataComPy you can head to [DataComPy](https://capitalone.github.io/datacompy/).
-----
## CLI
Command Line Interface to output an Excel .xlsx report and, optionally, html and txt summaries.
```bash
tabularcompare --help
```
```
Usage: tabularcompare [OPTIONS] DF1 DF2
Options:
-c, --columns, --on TEXT Comma-separated list of key column(s) to compare
df1 against df2. When not provided, df1 and df2
will be matched on index.
-ic, --ignore_columns TEXT Comma-separated list of column(s) to ignore from
df1 and df2.
-n1, --df1_name TEXT Alias for Data frame 1. Default = df1
-n2, --df2_name TEXT Alias for Data frame 2. Default = df2
-is, --ignore_spaces Flag to strip and ignore whitespaces from string
columns.
-ci, --case_insensitive Flag to compare string columns on a case-
insensitive manner.
-cl, --cast_lowercase Flag to cast column names to lower case before
comparison.
-at, --abs_tol FLOAT Absolute tolerance between two numeric values.
-rt, --rel_tol FLOAT Relative tolerance between two numeric values.
-txt, --txt Flag to output a .txt report with a comparison
summary.
-html, --html Flag to output an HTML report with a comparison
summary.
-od, --only_deltas Flag to suppress original dataframes from the
output .xlsx report.
-o, --output, --out PATH Output location for report files. Defaults to
current location.
-e, --encoding TEXT Character encoding to read df1 and df2.
-v, --verbose Verbosity.
--help Show this message and exit.
```
### Sample usage
The application reads from two file paths input for csv, json, or excel files.
```bash
cd ./data/
tabularcompare ./df1.csv ./df2.csv -c 'idx1,idx2' -n1 myTable1 -n2 myTable2 -o ../results/
```
-----
## Caveat
> The comparison results will take into account data types across columns. I.E. If we update our sample dataframe df2 to include a missing value on colB, it will now be of dtype ```object```, as oposed to ```Int64``` in df1. This might lead to miss-leading interpretations of the results to users without information on data types.
```python
import pandas as pd
from tabularcompare import Comparison
df1 = pd.DataFrame(
{
"idx1": ["A", "B", "B", "C"],
"idx2": ["01", "01", "02", "03"],
"colA": ["AA", "BA", "BB", "CA"],
"colB": [100, 200, 200, 300]
}
)
df2 = pd.DataFrame(
{
"idx1": ["A", "B", "C"],
"idx2": ["01", "01", "03"],
"colA": ["AA", "XA", "CA"],
"colB": [pd.NA, 200, 300],
"colC": ["foo", "bar", "baz"]
}
)
comparison = Comparison(
df1, df2, join_columns=["idx1", "idx2"]
)
comparison.diverging_subset()
```
Which will return:
| | idx1 | idx2 | colA | colB |
|:-:|-----:|-----:|--------------:|----------------:|
| 0 | A | 01 | NaN | {100} --> {} |
| 1 | B | 01 | {BA} --> {XA} | {200} --> {200} |
| 3 | C | 03 | NaN | {300} --> {300} |
Raw data
{
"_id": null,
"home_page": "",
"name": "tabularcompare",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "",
"keywords": "Tabular,Compare,Data",
"author": "",
"author_email": "Erich Henrique <erich.hs13@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/cc/4a/c8570559c5876a122814cad5b1abac315d0d73bc364475cc3953e48d256a/tabularcompare-0.0.4.tar.gz",
"platform": null,
"description": "# TabularCompare\n\n[![Python package](https://github.com/erich-hs/tabularcompare/actions/workflows/python-package.yml/badge.svg)](https://github.com/erich-hs/tabularcompare/actions/workflows/python-package.yml) [![PyPI license](https://img.shields.io/pypi/l/tabularcompare)](https://pypi.python.org/pypi/tabularcompare/) [![PyPI pyversions](https://img.shields.io/pypi/pyversions/tabularcompare.svg)](https://img.shields.io/pypi/pyversions/tabularcompare) ![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)\n\nTabular data comparison wrapper for [DataComPy](https://capitalone.github.io/datacompy/).\n\n## Quick Install\n```bash\npip install tabularcompare\n```\n\n## Basic Usage\n\n### Comparison object\n```python\nimport pandas as pd\nfrom tabularcompare import Comparison\n\ndf1 = pd.DataFrame(\n {\n \"idx1\": [\"A\", \"B\", \"B\", \"C\"],\n \"idx2\": [\"01\", \"01\", \"02\", \"03\"],\n \"colA\": [\"AA\", \"BA\", \"BB\", \"CA\"],\n \"colB\": [100, 200, 200, 300]\n }\n)\ndf2 = pd.DataFrame(\n {\n \"idx1\": [\"A\", \"B\", \"C\"],\n \"idx2\": [\"01\", \"01\", \"03\"],\n \"colA\": [\"AA\", \"XA\", \"CA\"],\n \"colB\": [101, 200, 300],\n \"colC\": [\"foo\", \"bar\", \"baz\"]\n }\n)\n\ncomparison = Comparison(\n df1, df2, join_columns=[\"idx1\", \"idx2\"]\n)\n```\n### Added Functionalities\n- #### Diverging Subset\nThis method introduces an enhanced look at the changes identified at the intersection of compared DataFrames, following the notation ```{df1} --> {df2}```.\n```python\ncomparison.diverging_subset()\n```\n| | idx1 | idx2 | colA | colB |\n|:-:|-----:|-----:|--------------:|----------------:|\n| 0 | A | 01 | NaN | {100} --> {101} |\n| 1 | B | 01 | {BA} --> {XA} | NaN |\n\nRows that are unique to either DataFrame can be called via ```.df1_unq_rows()``` and ```.df2_unq_rows()``` methods.\n\n```python\ncomparison.df1_unq_rows()\n```\n| | idx1 | idx2 |\tcolA | colB |\n|:-:|-----:|-----:|-----:|-----:|\n| 2 |\tB | 02 | BB |\t200 |\n\nColumns that are unique to either DataFrame can be called via ```.df1_unq_columns()``` and ```.df2_unq_columns()``` methods.\n\n```python\ncomparison.df2_unq_columns()\n```\n| | idx1 | idx2 |\t colC |\n|:-:|-----:|-----:|------:|\n| 0 |\tA | 01 | foo |\n| 1 |\tB | 01 | bar |\n| 2 |\tC | 03 | baz |\n\n- #### Enhanced Reporting\nThe report functionality is now callable from the comparison object. It includes a .txt, .html, and .xlsx version.\n```python\ncomparison.report_to_txt(\"./results/Report.txt\")\ncomparison.report_to_html(\"./results/Report.html\")\ncomparison.report_to_xlsx(\"./results/Report.xlsx\", write_originals=True)\n```\nThe Excel report will output complete comparison results, with tabs dedicated to:\n* Original dataframes (when ```write_originals=True```).\n* Columns present only on df1 and/or df2.\n* Rows present only on df1 and/or df2.\n* Diverging subset showing all the changes identified from df1 to df2.\n\nThe HTML report will also output a rendered table of the diverging subset on top of the file, alongside the native DataComPy summary report.\n\n- #### Fully-fledged [Command Line Interface](#CLI)\n---\n### DataComPy Methods\nMost methods native to ```datacompy.Compare``` functionality are still present, including;\n* ```.report()```\n* ```.df1_unq_rows()``` / ```.df2_unq_rows()```\n* ```.df1_unq_columns()``` / ```.df2_unq_columns()```\n* ```.intersect_columns()```\n* ```.intersect_rows()```\n\nThe native ```datacompy.Compare``` method is also callable from the ```tabularcompare``` core module:\n```python\nfrom tabularcompare import Compare\n\n# datacompy.Compare method\ncomparison = Compare(\n df1, df2, join_columns=[\"idx1\", \"idx2\"]\n)\nprint(comparison.report())\n```\nFor a complete documentation on DataComPy you can head to [DataComPy](https://capitalone.github.io/datacompy/).\n\n-----\n## CLI\nCommand Line Interface to output an Excel .xlsx report and, optionally, html and txt summaries.\n```bash\ntabularcompare --help\n```\n\n```\nUsage: tabularcompare [OPTIONS] DF1 DF2\n\nOptions:\n -c, --columns, --on TEXT Comma-separated list of key column(s) to compare\n df1 against df2. When not provided, df1 and df2\n will be matched on index.\n -ic, --ignore_columns TEXT Comma-separated list of column(s) to ignore from\n df1 and df2.\n -n1, --df1_name TEXT Alias for Data frame 1. Default = df1\n -n2, --df2_name TEXT Alias for Data frame 2. Default = df2\n -is, --ignore_spaces Flag to strip and ignore whitespaces from string\n columns.\n -ci, --case_insensitive Flag to compare string columns on a case-\n insensitive manner.\n -cl, --cast_lowercase Flag to cast column names to lower case before\n comparison.\n -at, --abs_tol FLOAT Absolute tolerance between two numeric values.\n -rt, --rel_tol FLOAT Relative tolerance between two numeric values.\n -txt, --txt Flag to output a .txt report with a comparison\n summary.\n -html, --html Flag to output an HTML report with a comparison\n summary.\n -od, --only_deltas Flag to suppress original dataframes from the\n output .xlsx report.\n -o, --output, --out PATH Output location for report files. Defaults to\n current location.\n -e, --encoding TEXT Character encoding to read df1 and df2.\n -v, --verbose Verbosity.\n --help Show this message and exit.\n```\n\n### Sample usage\nThe application reads from two file paths input for csv, json, or excel files.\n```bash\ncd ./data/\ntabularcompare ./df1.csv ./df2.csv -c 'idx1,idx2' -n1 myTable1 -n2 myTable2 -o ../results/\n```\n\n-----\n## Caveat\n> The comparison results will take into account data types across columns. I.E. If we update our sample dataframe df2 to include a missing value on colB, it will now be of dtype ```object```, as oposed to ```Int64``` in df1. This might lead to miss-leading interpretations of the results to users without information on data types.\n\n```python\nimport pandas as pd\nfrom tabularcompare import Comparison\n\ndf1 = pd.DataFrame(\n {\n \"idx1\": [\"A\", \"B\", \"B\", \"C\"],\n \"idx2\": [\"01\", \"01\", \"02\", \"03\"],\n \"colA\": [\"AA\", \"BA\", \"BB\", \"CA\"],\n \"colB\": [100, 200, 200, 300]\n }\n)\ndf2 = pd.DataFrame(\n {\n \"idx1\": [\"A\", \"B\", \"C\"],\n \"idx2\": [\"01\", \"01\", \"03\"],\n \"colA\": [\"AA\", \"XA\", \"CA\"],\n \"colB\": [pd.NA, 200, 300],\n \"colC\": [\"foo\", \"bar\", \"baz\"]\n }\n)\n\ncomparison = Comparison(\n df1, df2, join_columns=[\"idx1\", \"idx2\"]\n)\ncomparison.diverging_subset()\n```\nWhich will return:\n| | idx1 | idx2 | colA | colB |\n|:-:|-----:|-----:|--------------:|----------------:|\n| 0 | A | 01 | NaN | {100} --> {} |\n| 1 | B | 01 | {BA} --> {XA} | {200} --> {200} |\n| 3 | C | 03 | NaN | {300} --> {300} |\n",
"bugtrack_url": null,
"license": "Apache-2.0",
"summary": "Tabular data comparison wrapper for DataComPy",
"version": "0.0.4",
"split_keywords": [
"tabular",
"compare",
"data"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "c111cee5fdddaca4405fc04ec18c6a0c24975ee49ab1985cf4eefd9e1dab456d",
"md5": "b79e7c984ccc06c3f8dbbecb4a8d6ed3",
"sha256": "a870403a30b004b8ba4a9cded52d13795798ea2d9c9828b6a46d57022854bb83"
},
"downloads": -1,
"filename": "tabularcompare-0.0.4-py3-none-any.whl",
"has_sig": false,
"md5_digest": "b79e7c984ccc06c3f8dbbecb4a8d6ed3",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 14805,
"upload_time": "2023-04-22T06:47:49",
"upload_time_iso_8601": "2023-04-22T06:47:49.846262Z",
"url": "https://files.pythonhosted.org/packages/c1/11/cee5fdddaca4405fc04ec18c6a0c24975ee49ab1985cf4eefd9e1dab456d/tabularcompare-0.0.4-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "cc4ac8570559c5876a122814cad5b1abac315d0d73bc364475cc3953e48d256a",
"md5": "4e849861fc8c2eb4bb5a5755958913b3",
"sha256": "b4041418eead523bfe73de75481632332dfda77725b202fec471c01dd5d0db2d"
},
"downloads": -1,
"filename": "tabularcompare-0.0.4.tar.gz",
"has_sig": false,
"md5_digest": "4e849861fc8c2eb4bb5a5755958913b3",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 19726,
"upload_time": "2023-04-22T06:47:51",
"upload_time_iso_8601": "2023-04-22T06:47:51.652511Z",
"url": "https://files.pythonhosted.org/packages/cc/4a/c8570559c5876a122814cad5b1abac315d0d73bc364475cc3953e48d256a/tabularcompare-0.0.4.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-04-22 06:47:51",
"github": false,
"gitlab": false,
"bitbucket": false,
"lcname": "tabularcompare"
}