pl-compare


Namepl-compare JSON
Version 0.6.0 PyPI version JSON
download
home_pageNone
SummaryA tool to find the differences between two tables.
upload_time2024-10-27 15:32:05
maintainerNone
docs_urlNone
authorYour Name
requires_python<4.0,>=3.8
licenseNone
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # pl_compare: Compare and find the differences between two Polars DataFrames. 

[Github](https://github.com/concur1/pl_compare) - [PyPi Page](https://pypi.org/project/pl-compare/)

**You will find pl-compare useful if you find yourself writing various SQL/Dataframe operations to**:
- Understand how well two tables Reconcile [example](#Full-report)
- Find the schema differences between two tables [example](#Schema-differences-summary-and-details)
- Find counts or examples of rows that exist in one table but not another [example](#Row-differences-summary-and-details)
- Find counts or examples of value differences between two tables [example](#Value-differences-summary-and-details)
- Assert that two tables are exactly equal (such as for an automated test) [example](#Assert-two-frames-are-equal-for-a-test)
- Assert that two tables have matching schemas, rows or column values [example](#Return-booleans-to-check-for-schema-row-and-value-differences)

[Click for a jupyter notebook with example usage](https://github.com/concur1/pl_compare/blob/main/pl_compare_demo.ipynb)

![](demo1.gif)

**With pl-compare you can**:
- Get statistical summaries and/or examples and/or a boolean to indicate:
  - Schema differences
  - Row differences
  - Value differences
- Easily works for Pandas dataframes and other tabular data formats with conversion using Apache arrow 
- View differences as a text report
- Get differences as a Polars LazyFrame or DataFrame
- Use LazyFrames for larger than memory comparisons
- Specify the equality calculation that is used to dermine value differences


## Installation

```zsh
pip install pl_compare
```

## Examples (click to expand)

### Return booleans to check for schema, row and value differences 


```python
>>> import polars as pl
>>> from pl_compare import compare
>>>
>>> base_df = pl.DataFrame(
...     {
...         "ID": ["123456", "1234567", "12345678"],
...         "Example1": [1, 6, 3],
...         "Example2": ["1", "2", "3"],
...     }
... )
>>> compare_df = pl.DataFrame(
...     {
...         "ID": ["123456", "1234567", "1234567810"],
...         "Example1": [1, 2, 3],
...         "Example2": [1, 2, 3],
...         "Example3": [1, 2, 3],
...     },
... )
>>>
>>> compare_result = compare(["ID"], base_df, compare_df)
>>> print("is_schemas_equal:", compare_result.is_schemas_equal())
is_schemas_equal: False
>>> print("is_rows_equal:", compare_result.is_rows_equal())
is_rows_equal: False
>>> print("is_values_equal:", compare_result.is_values_equal())
is_values_equal: False
>>>
```


### Schema differences summary and details 


```python
>>> import polars as pl
>>> from pl_compare import compare
>>>
>>> base_df = pl.DataFrame(
...     {
...         "ID": ["123456", "1234567", "12345678"],
...         "Example1": [1, 6, 3],
...         "Example2": ["1", "2", "3"],
...     }
... )
>>> compare_df = pl.DataFrame(
...     {
...         "ID": ["123456", "1234567", "1234567810"],
...         "Example1": [1, 2, 3],
...         "Example2": [1, 2, 3],
...         "Example3": [1, 2, 3],
...     },
... )
>>>
>>> compare_result = compare(["ID"], base_df, compare_df)
>>> print("schemas_summary()")
schemas_summary()
>>> print(compare_result.schemas_summary())
shape: (6, 2)
┌─────────────────────────────────┬───────┐
│ Statistic                       ┆ Count │
│ ---                             ┆ ---   │
│ str                             ┆ i64   │
╞═════════════════════════════════╪═══════╡
│ Columns in base                 ┆ 3     │
│ Columns in compare              ┆ 4     │
│ Columns in base and compare     ┆ 3     │
│ Columns only in base            ┆ 0     │
│ Columns only in compare         ┆ 1     │
│ Columns with schema difference... ┆ 1     │
└─────────────────────────────────┴───────┘
>>> print("schemas_sample()")
schemas_sample()
>>> print(compare_result.schemas_sample())
shape: (2, 3)
┌──────────┬─────────────┬────────────────┐
│ column   ┆ base_format ┆ compare_format │
│ ---      ┆ ---         ┆ ---            │
│ str      ┆ str         ┆ str            │
╞══════════╪═════════════╪════════════════╡
│ Example2 ┆ String      ┆ Int64          │
│ Example3 ┆ null        ┆ Int64          │
└──────────┴─────────────┴────────────────┘
>>>
```


### Row differences summary and details 


```python
>>> import polars as pl
>>> from pl_compare import compare
>>>
>>> base_df = pl.DataFrame(
...     {
...         "ID": ["123456", "1234567", "12345678"],
...         "Example1": [1, 6, 3],
...         "Example2": ["1", "2", "3"],
...     }
... )
>>> compare_df = pl.DataFrame(
...     {
...         "ID": ["123456", "1234567", "1234567810"],
...         "Example1": [1, 2, 3],
...         "Example2": [1, 2, 3],
...         "Example3": [1, 2, 3],
...     },
... )
>>>
>>> compare_result = compare(["ID"], base_df, compare_df)
>>> print("rows_summary()")
rows_summary()
>>> print(compare_result.rows_summary())
shape: (5, 2)
┌──────────────────────────┬───────┐
│ Statistic                ┆ Count │
│ ---                      ┆ ---   │
│ str                      ┆ i64   │
╞══════════════════════════╪═══════╡
│ Rows in base             ┆ 3     │
│ Rows in compare          ┆ 3     │
│ Rows only in base        ┆ 1     │
│ Rows only in compare     ┆ 1     │
│ Rows in base and compare ┆ 2     │
└──────────────────────────┴───────┘
>>> print("rows_sample()")
rows_sample()
>>> print(compare_result.rows_sample())
shape: (2, 3)
┌────────────┬──────────┬─────────────────┐
│ ID         ┆ variable ┆ value           │
│ ---        ┆ ---      ┆ ---             │
│ str        ┆ str      ┆ str             │
╞════════════╪══════════╪═════════════════╡
│ 12345678   ┆ status   ┆ in base only    │
│ 1234567810 ┆ status   ┆ in compare only │
└────────────┴──────────┴─────────────────┘
>>>
```


### Value differences summary and details 


```python
>>> import polars as pl
>>> from pl_compare import compare
>>>
>>> base_df = pl.DataFrame(
...     {
...         "ID": ["123456", "1234567", "12345678"],
...         "Example1": [1, 6, 3],
...         "Example2": ["1", "2", "3"],
...     }
... )
>>> compare_df = pl.DataFrame(
...     {
...         "ID": ["123456", "1234567", "1234567810"],
...         "Example1": [1, 2, 3],
...         "Example2": [1, 2, 3],
...         "Example3": [1, 2, 3],
...     },
... )
>>>
>>> compare_result = compare(["ID"], base_df, compare_df)
>>> print("values_summary()")
values_summary()
>>> print(compare_result.values_summary())
shape: (2, 3)
┌─────────────────────────┬───────┬────────────┐
│ Value Differences       ┆ Count ┆ Percentage │
│ ---                     ┆ ---   ┆ ---        │
│ str                     ┆ i64   ┆ f64        │
╞═════════════════════════╪═══════╪════════════╡
│ Total Value Differences ┆ 1     ┆ 50.0       │
│ Example1                ┆ 1     ┆ 50.0       │
└─────────────────────────┴───────┴────────────┘
>>> print("values_sample()")
values_sample()
>>> print(compare_result.values_sample())
shape: (1, 4)
┌─────────┬──────────┬──────┬─────────┐
│ ID      ┆ variable ┆ base ┆ compare │
│ ---     ┆ ---      ┆ ---  ┆ ---     │
│ str     ┆ str      ┆ i64  ┆ i64     │
╞═════════╪══════════╪══════╪═════════╡
│ 1234567 ┆ Example1 ┆ 6    ┆ 2       │
└─────────┴──────────┴──────┴─────────┘
>>>
```


### Full report 

```python
>>> import polars as pl
>>> from pl_compare import compare
>>>
>>> base_df = pl.DataFrame(
...     {
...         "ID": ["123456", "1234567", "12345678"],
...         "Example1": [1, 6, 3],
...         "Example2": ["1", "2", "3"],
...     }
... )
>>> compare_df = pl.DataFrame(
...     {
...         "ID": ["123456", "1234567", "1234567810"],
...         "Example1": [1, 2, 3],
...         "Example2": [1, 2, 3],
...         "Example3": [1, 2, 3],
...     },
... )
>>>
>>> compare_result = compare(["ID"], base_df, compare_df)
>>> compare_result.report()
--------------------------------------------------------------------------------
COMPARISON REPORT
--------------------------------------------------------------------------------
<BLANKLINE>
SCHEMA DIFFERENCES:
shape: (6, 2)
┌─────────────────────────────────┬───────┐
│ Statistic                       ┆ Count │
│ ---                             ┆ ---   │
│ str                             ┆ i64   │
╞═════════════════════════════════╪═══════╡
│ Columns in base                 ┆ 3     │
│ Columns in compare              ┆ 4     │
│ Columns in base and compare     ┆ 3     │
│ Columns only in base            ┆ 0     │
│ Columns only in compare         ┆ 1     │
│ Columns with schema difference... ┆ 1     │
└─────────────────────────────────┴───────┘
shape: (2, 3)
┌──────────┬─────────────┬────────────────┐
│ column   ┆ base_format ┆ compare_format │
│ ---      ┆ ---         ┆ ---            │
│ str      ┆ str         ┆ str            │
╞══════════╪═════════════╪════════════════╡
│ Example2 ┆ String      ┆ Int64          │
│ Example3 ┆ null        ┆ Int64          │
└──────────┴─────────────┴────────────────┘
--------------------------------------------------------------------------------
<BLANKLINE>
ROW DIFFERENCES:
shape: (5, 2)
┌──────────────────────────┬───────┐
│ Statistic                ┆ Count │
│ ---                      ┆ ---   │
│ str                      ┆ i64   │
╞══════════════════════════╪═══════╡
│ Rows in base             ┆ 3     │
│ Rows in compare          ┆ 3     │
│ Rows only in base        ┆ 1     │
│ Rows only in compare     ┆ 1     │
│ Rows in base and compare ┆ 2     │
└──────────────────────────┴───────┘
shape: (2, 3)
┌────────────┬──────────┬─────────────────┐
│ ID         ┆ variable ┆ value           │
│ ---        ┆ ---      ┆ ---             │
│ str        ┆ str      ┆ str             │
╞════════════╪══════════╪═════════════════╡
│ 12345678   ┆ status   ┆ in base only    │
│ 1234567810 ┆ status   ┆ in compare only │
└────────────┴──────────┴─────────────────┘
--------------------------------------------------------------------------------
<BLANKLINE>
VALUE DIFFERENCES:
shape: (2, 3)
┌─────────────────────────┬───────┬────────────┐
│ Value Differences       ┆ Count ┆ Percentage │
│ ---                     ┆ ---   ┆ ---        │
│ str                     ┆ i64   ┆ f64        │
╞═════════════════════════╪═══════╪════════════╡
│ Total Value Differences ┆ 1     ┆ 50.0       │
│ Example1                ┆ 1     ┆ 50.0       │
└─────────────────────────┴───────┴────────────┘
shape: (1, 4)
┌─────────┬──────────┬──────┬─────────┐
│ ID      ┆ variable ┆ base ┆ compare │
│ ---     ┆ ---      ┆ ---  ┆ ---     │
│ str     ┆ str      ┆ i64  ┆ i64     │
╞═════════╪══════════╪══════╪═════════╡
│ 1234567 ┆ Example1 ┆ 6    ┆ 2       │
└─────────┴──────────┴──────┴─────────┘
--------------------------------------------------------------------------------
End of Report
--------------------------------------------------------------------------------
>>>
```


### Compare two pandas dataframes 


```python
>>> import polars as pl
>>> import pandas as pd # doctest: +SKIP
>>> from pl_compare import compare
>>>
>>> base_df = pd.DataFrame(data=
...     {
...         "ID": ["123456", "1234567", "12345678"],
...         "Example1": [1, 6, 3],
...         "Example2": ["1", "2", "3"],
...     }
... )# doctest: +SKIP
>>> compare_df = pd.DataFrame(data=
...     {
...         "ID": ["123456", "1234567", "1234567810"],
...         "Example1": [1, 2, 3],
...         "Example2": [1, 2, 3],
...         "Example3": [1, 2, 3],
...     },
... )# doctest: +SKIP
>>>
>>> compare_result = compare(["ID"], pl.from_pandas(base_df), pl.from_pandas(compare_df))# doctest: +SKIP
>>> compare_result.report()# doctest: +SKIP
--------------------------------------------------------------------------------
COMPARISON REPORT
--------------------------------------------------------------------------------

SCHEMA DIFFERENCES:
shape: (6, 2)
┌─────────────────────────────────┬───────┐
│ Statistic                       ┆ Count │
│ ---                             ┆ ---   │
│ str                             ┆ i64   │
╞═════════════════════════════════╪═══════╡
│ Columns in base                 ┆ 3     │
│ Columns in compare              ┆ 4     │
│ Columns in base and compare     ┆ 3     │
│ Columns only in base            ┆ 0     │
│ Columns only in compare         ┆ 1     │
│ Columns with schema differences ┆ 1     │
└─────────────────────────────────┴───────┘
shape: (2, 3)
┌──────────┬─────────────┬────────────────┐
│ column   ┆ base_format ┆ compare_format │
│ ---      ┆ ---         ┆ ---            │
│ str      ┆ str         ┆ str            │
╞══════════╪═════════════╪════════════════╡
│ Example2 ┆ String      ┆ Int64          │
│ Example3 ┆ null        ┆ Int64          │
└──────────┴─────────────┴────────────────┘
--------------------------------------------------------------------------------

ROW DIFFERENCES:
shape: (5, 2)
┌──────────────────────────┬───────┐
│ Statistic                ┆ Count │
│ ---                      ┆ ---   │
│ str                      ┆ i64   │
╞══════════════════════════╪═══════╡
│ Rows in base             ┆ 3     │
│ Rows in compare          ┆ 3     │
│ Rows only in base        ┆ 1     │
│ Rows only in compare     ┆ 1     │
│ Rows in base and compare ┆ 2     │
└──────────────────────────┴───────┘
shape: (2, 3)
┌────────────┬──────────┬─────────────────┐
│ ID         ┆ variable ┆ value           │
│ ---        ┆ ---      ┆ ---             │
│ str        ┆ str      ┆ str             │
╞════════════╪══════════╪═════════════════╡
│ 12345678   ┆ status   ┆ in base only    │
│ 1234567810 ┆ status   ┆ in compare only │
└────────────┴──────────┴─────────────────┘
--------------------------------------------------------------------------------

VALUE DIFFERENCES:
shape: (2, 3)
┌─────────────────────────┬───────┬────────────┐
│ Value Differences       ┆ Count ┆ Percentage │
│ ---                     ┆ ---   ┆ ---        │
│ str                     ┆ i64   ┆ f64        │
╞═════════════════════════╪═══════╪════════════╡
│ Total Value Differences ┆ 1     ┆ 50.0       │
│ Example1                ┆ 1     ┆ 50.0       │
└─────────────────────────┴───────┴────────────┘
shape: (1, 4)
┌─────────┬──────────┬──────┬─────────┐
│ ID      ┆ variable ┆ base ┆ compare │
│ ---     ┆ ---      ┆ ---  ┆ ---     │
│ str     ┆ str      ┆ i64  ┆ i64     │
╞═════════╪══════════╪══════╪═════════╡
│ 1234567 ┆ Example1 ┆ 6    ┆ 2       │
└─────────┴──────────┴──────┴─────────┘
--------------------------------------------------------------------------------
End of Report
--------------------------------------------------------------------------------
>>>
```



### Specify a threshold to control the granularity of the comparison for numeric columns. 


```python
>>> import polars as pl
>>> from pl_compare import compare
>>>
>>> base_df = pl.DataFrame(
...     {
...         "ID": ["123456", "1234567", "12345678"],
...         "Example1": [1.111, 6.11, 3.11],
...     }
... )
>>>
>>> compare_df = pl.DataFrame(
...     {
...         "ID": ["123456", "1234567", "1234567810"],
...         "Example1": [1.114, 6.14, 3.12],
...     },
... )
>>>
>>> print("With equality_resolution of 0.01")
With equality_resolution of 0.01
>>> compare_result = compare(["ID"], base_df, compare_df, resolution=0.01)
>>> print(compare_result.values_sample())
shape: (1, 4)
┌─────────┬──────────┬──────┬─────────┐
│ ID      ┆ variable ┆ base ┆ compare │
│ ---     ┆ ---      ┆ ---  ┆ ---     │
│ str     ┆ str      ┆ f64  ┆ f64     │
╞═════════╪══════════╪══════╪═════════╡
│ 1234567 ┆ Example1 ┆ 6.11 ┆ 6.14    │
└─────────┴──────────┴──────┴─────────┘
>>> print("With no equality_resolution")
With no equality_resolution
>>> compare_result = compare(["ID"], base_df, compare_df)
>>> print(compare_result.values_sample())
shape: (2, 4)
┌─────────┬──────────┬───────┬─────────┐
│ ID      ┆ variable ┆ base  ┆ compare │
│ ---     ┆ ---      ┆ ---   ┆ ---     │
│ str     ┆ str      ┆ f64   ┆ f64     │
╞═════════╪══════════╪═══════╪═════════╡
│ 123456  ┆ Example1 ┆ 1.111 ┆ 1.114   │
│ 1234567 ┆ Example1 ┆ 6.11  ┆ 6.14    │
└─────────┴──────────┴───────┴─────────┘
>>>
```



### Example using alias for base and compare dataframes. 


```python
>>> import polars as pl
>>> from pl_compare import compare
>>>
>>> base_df = pl.DataFrame(
...     {
...         "ID": ["123456", "1234567", "12345678"],
...         "Example1": [1, 6, 3],
...         "Example2": ["1", "2", "3"],
...     }
... )
>>> compare_df = pl.DataFrame(
...     {
...         "ID": ["123456", "1234567", "1234567810"],
...         "Example1": [1, 2, 3],
...         "Example2": [1, 2, 3],
...         "Example3": [1, 2, 3],
...     },
... )
>>>
>>> compare_result = compare(["ID"],
...                          base_df,
...                          compare_df,
...                          base_alias="before_change",
...                          compare_alias="after_change")
>>>
>>> print("values_summary()")
values_summary()
>>> print(compare_result.schemas_sample())
shape: (2, 3)
┌──────────┬──────────────────────┬─────────────────────┐
│ column   ┆ before_change_format ┆ after_change_format │
│ ---      ┆ ---                  ┆ ---                 │
│ str      ┆ str                  ┆ str                 │
╞══════════╪══════════════════════╪═════════════════════╡
│ Example2 ┆ String               ┆ Int64               │
│ Example3 ┆ null                 ┆ Int64               │
└──────────┴──────────────────────┴─────────────────────┘
>>> print("values_sample()")
values_sample()
>>> print(compare_result.values_sample())
shape: (1, 4)
┌─────────┬──────────┬───────────────┬──────────────┐
│ ID      ┆ variable ┆ before_change ┆ after_change │
│ ---     ┆ ---      ┆ ---           ┆ ---          │
│ str     ┆ str      ┆ i64           ┆ i64          │
╞═════════╪══════════╪═══════════════╪══════════════╡
│ 1234567 ┆ Example1 ┆ 6             ┆ 2            │
└─────────┴──────────┴───────────────┴──────────────┘
>>>
```


### Assert two frames are equal for a test 


```python
>>> import polars as pl
>>> import pytest
>>> from pl_compare.compare import compare
>>>
>>> def test_example():
...     base_df = pl.DataFrame(
...         {
...             "ID": ["123456", "1234567", "12345678"],
...             "Example1": [1, 6, 3],
...             "Example2": [1, 2, 3],
...         }
...     )
...     compare_df = pl.DataFrame(
...         {
...             "ID": ["123456", "1234567", "12345678"],
...             "Example1": [1, 6, 9],
...             "Example2": [1, 2, 3],
...         }
...     )
...     comparison = compare(["ID"], base_df, compare_df)
...     if not comparison.is_equal():
...         raise Exception(comparison.report())
...
>>> test_example() # doctest: +IGNORE_EXCEPTION_DETAIL
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 18, in test_example
Exception: --------------------------------------------------------------------------------
COMPARISON REPORT
--------------------------------------------------------------------------------
No Schema differences found.
--------------------------------------------------------------------------------
No Row differences found (when joining by the supplied id_columns).
--------------------------------------------------------------------------------

VALUE DIFFERENCES:
shape: (3, 3)
┌─────────────────────────┬───────┬────────────┐
│ Value Differences       ┆ Count ┆ Percentage │
│ ---                     ┆ ---   ┆ ---        │
│ str                     ┆ i64   ┆ f64        │
╞═════════════════════════╪═══════╪════════════╡
│ Total Value Differences ┆ 1     ┆ 16.666667  │
│ Example1                ┆ 1     ┆ 33.333333  │
│ Example2                ┆ 0     ┆ 0.0        │
└─────────────────────────┴───────┴────────────┘
shape: (1, 4)
┌──────────┬──────────┬──────┬─────────┐
│ ID       ┆ variable ┆ base ┆ compare │
│ ---      ┆ ---      ┆ ---  ┆ ---     │
│ str      ┆ str      ┆ i64  ┆ i64     │
╞══════════╪══════════╪══════╪═════════╡
│ 12345678 ┆ Example1 ┆ 3    ┆ 9       │
└──────────┴──────────┴──────┴─────────┘
--------------------------------------------------------------------------------
End of Report
--------------------------------------------------------------------------------
>>>
```



### To DO:
- [x] Linting (Ruff)
- [x] Make into python package
- [x] Add makefile for easy linting and tests
- [x] Statistics should indicate which statistics are referencing columns
- [x] Add all statistics frame to tests
- [x] Add schema differences to schema summary
- [x] Make row examples alternate between base only and compare only so that it is more readable.
- [x] Add limit value to the examples.
- [x] Updated value differences summary so that Statistic is something that makes sense.
- [x] Publish package to pypi
- [x] Add difference criterion.
- [x] Add license
- [x] Make package easy to use (i.e. so you only have to import pl_compare and then you can us pl_compare)
- [x] Add table name labels that can replace 'base' and 'compare'.
- [x] Update code to use a config dataclass that can be passed between the class and functions.
- [x] Write up docstrings
- [x] Write up readme (with code examples)
- [x] Add parameter to hide column differences with 0 differences.
- [x] Add flag to indicate if there are differences between the tables.
- [x] Update report so that non differences are not displayed.
- [x] Seperate out dev dependencies from library dependencies?
- [x] Change 'threshold' to be equality resolution.
- [x] strict MyPy type checking
- [x] Raise error and print examples if duplicates are present.
- [x] Add total number of value differences to the value differences summary.
- [x] Add percentage column so the value differences summary.
- [x] Change id_columns to be named 'join_columns' 
- [x] Github actions for publishing
- [x] Update the duplication validation.
- [x] Fix report output when tables are exactly equal.
- [x] Github actions for testing
- [x] Github actions for linting
- [x] Add message when there are no columns left to be compared.
- [x] Add message when df's are exactly equal. 
- [x] Add test case with exactly equal dfs.
- [x] Add test case with no columns being compared. Make sure an error is raised when the value_summary and value_sample methods are called.
- [] Test for large amounts of data
- [] Benchmark for different sizes of data.
- [] Investigate use for very large datasets 50GB-100GB. Can this be done using LazyFrames only?
- [] There still seems to be a bug when converting from lazy to data frame using streaming (i.e. in the convert_to_dataframe function)

## Ideas:
- [] Simplify custom equality checks and add example.
- [] Add a count of the number of rows that have any differences to the value differences summary.
- [] add a test that checks that abritrary join conditions work.



            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "pl-compare",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.8",
    "maintainer_email": null,
    "keywords": null,
    "author": "Your Name",
    "author_email": "you@example.com",
    "download_url": "https://files.pythonhosted.org/packages/74/3b/0a8daf6bf70d7fe2fb9eaaf6e26f93c4993ba82d3c2bba18c8e309cf778d/pl_compare-0.6.0.tar.gz",
    "platform": null,
    "description": "# pl_compare: Compare and find the differences between two Polars DataFrames. \n\n[Github](https://github.com/concur1/pl_compare) - [PyPi Page](https://pypi.org/project/pl-compare/)\n\n**You will find pl-compare useful if you find yourself writing various SQL/Dataframe operations to**:\n- Understand how well two tables Reconcile [example](#Full-report)\n- Find the schema differences between two tables [example](#Schema-differences-summary-and-details)\n- Find counts or examples of rows that exist in one table but not another [example](#Row-differences-summary-and-details)\n- Find counts or examples of value differences between two tables [example](#Value-differences-summary-and-details)\n- Assert that two tables are exactly equal (such as for an automated test) [example](#Assert-two-frames-are-equal-for-a-test)\n- Assert that two tables have matching schemas, rows or column values [example](#Return-booleans-to-check-for-schema-row-and-value-differences)\n\n[Click for a jupyter notebook with example usage](https://github.com/concur1/pl_compare/blob/main/pl_compare_demo.ipynb)\n\n![](demo1.gif)\n\n**With pl-compare you can**:\n- Get statistical summaries and/or examples and/or a boolean to indicate:\n  - Schema differences\n  - Row differences\n  - Value differences\n- Easily works for Pandas dataframes and other tabular data formats with conversion using Apache arrow \n- View differences as a text report\n- Get differences as a Polars LazyFrame or DataFrame\n- Use LazyFrames for larger than memory comparisons\n- Specify the equality calculation that is used to dermine value differences\n\n\n## Installation\n\n```zsh\npip install pl_compare\n```\n\n## Examples (click to expand)\n\n### Return booleans to check for schema, row and value differences \n\n\n```python\n>>> import polars as pl\n>>> from pl_compare import compare\n>>>\n>>> base_df = pl.DataFrame(\n...     {\n...         \"ID\": [\"123456\", \"1234567\", \"12345678\"],\n...         \"Example1\": [1, 6, 3],\n...         \"Example2\": [\"1\", \"2\", \"3\"],\n...     }\n... )\n>>> compare_df = pl.DataFrame(\n...     {\n...         \"ID\": [\"123456\", \"1234567\", \"1234567810\"],\n...         \"Example1\": [1, 2, 3],\n...         \"Example2\": [1, 2, 3],\n...         \"Example3\": [1, 2, 3],\n...     },\n... )\n>>>\n>>> compare_result = compare([\"ID\"], base_df, compare_df)\n>>> print(\"is_schemas_equal:\", compare_result.is_schemas_equal())\nis_schemas_equal: False\n>>> print(\"is_rows_equal:\", compare_result.is_rows_equal())\nis_rows_equal: False\n>>> print(\"is_values_equal:\", compare_result.is_values_equal())\nis_values_equal: False\n>>>\n```\n\n\n### Schema differences summary and details \n\n\n```python\n>>> import polars as pl\n>>> from pl_compare import compare\n>>>\n>>> base_df = pl.DataFrame(\n...     {\n...         \"ID\": [\"123456\", \"1234567\", \"12345678\"],\n...         \"Example1\": [1, 6, 3],\n...         \"Example2\": [\"1\", \"2\", \"3\"],\n...     }\n... )\n>>> compare_df = pl.DataFrame(\n...     {\n...         \"ID\": [\"123456\", \"1234567\", \"1234567810\"],\n...         \"Example1\": [1, 2, 3],\n...         \"Example2\": [1, 2, 3],\n...         \"Example3\": [1, 2, 3],\n...     },\n... )\n>>>\n>>> compare_result = compare([\"ID\"], base_df, compare_df)\n>>> print(\"schemas_summary()\")\nschemas_summary()\n>>> print(compare_result.schemas_summary())\nshape: (6, 2)\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Statistic                       \u2506 Count \u2502\n\u2502 ---                             \u2506 ---   \u2502\n\u2502 str                             \u2506 i64   \u2502\n\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n\u2502 Columns in base                 \u2506 3     \u2502\n\u2502 Columns in compare              \u2506 4     \u2502\n\u2502 Columns in base and compare     \u2506 3     \u2502\n\u2502 Columns only in base            \u2506 0     \u2502\n\u2502 Columns only in compare         \u2506 1     \u2502\n\u2502 Columns with schema difference... \u2506 1     \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n>>> print(\"schemas_sample()\")\nschemas_sample()\n>>> print(compare_result.schemas_sample())\nshape: (2, 3)\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 column   \u2506 base_format \u2506 compare_format \u2502\n\u2502 ---      \u2506 ---         \u2506 ---            \u2502\n\u2502 str      \u2506 str         \u2506 str            \u2502\n\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n\u2502 Example2 \u2506 String      \u2506 Int64          \u2502\n\u2502 Example3 \u2506 null        \u2506 Int64          \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n>>>\n```\n\n\n### Row differences summary and details \n\n\n```python\n>>> import polars as pl\n>>> from pl_compare import compare\n>>>\n>>> base_df = pl.DataFrame(\n...     {\n...         \"ID\": [\"123456\", \"1234567\", \"12345678\"],\n...         \"Example1\": [1, 6, 3],\n...         \"Example2\": [\"1\", \"2\", \"3\"],\n...     }\n... )\n>>> compare_df = pl.DataFrame(\n...     {\n...         \"ID\": [\"123456\", \"1234567\", \"1234567810\"],\n...         \"Example1\": [1, 2, 3],\n...         \"Example2\": [1, 2, 3],\n...         \"Example3\": [1, 2, 3],\n...     },\n... )\n>>>\n>>> compare_result = compare([\"ID\"], base_df, compare_df)\n>>> print(\"rows_summary()\")\nrows_summary()\n>>> print(compare_result.rows_summary())\nshape: (5, 2)\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Statistic                \u2506 Count \u2502\n\u2502 ---                      \u2506 ---   \u2502\n\u2502 str                      \u2506 i64   \u2502\n\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n\u2502 Rows in base             \u2506 3     \u2502\n\u2502 Rows in compare          \u2506 3     \u2502\n\u2502 Rows only in base        \u2506 1     \u2502\n\u2502 Rows only in compare     \u2506 1     \u2502\n\u2502 Rows in base and compare \u2506 2     \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n>>> print(\"rows_sample()\")\nrows_sample()\n>>> print(compare_result.rows_sample())\nshape: (2, 3)\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 ID         \u2506 variable \u2506 value           \u2502\n\u2502 ---        \u2506 ---      \u2506 ---             \u2502\n\u2502 str        \u2506 str      \u2506 str             \u2502\n\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n\u2502 12345678   \u2506 status   \u2506 in base only    \u2502\n\u2502 1234567810 \u2506 status   \u2506 in compare only \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n>>>\n```\n\n\n### Value differences summary and details \n\n\n```python\n>>> import polars as pl\n>>> from pl_compare import compare\n>>>\n>>> base_df = pl.DataFrame(\n...     {\n...         \"ID\": [\"123456\", \"1234567\", \"12345678\"],\n...         \"Example1\": [1, 6, 3],\n...         \"Example2\": [\"1\", \"2\", \"3\"],\n...     }\n... )\n>>> compare_df = pl.DataFrame(\n...     {\n...         \"ID\": [\"123456\", \"1234567\", \"1234567810\"],\n...         \"Example1\": [1, 2, 3],\n...         \"Example2\": [1, 2, 3],\n...         \"Example3\": [1, 2, 3],\n...     },\n... )\n>>>\n>>> compare_result = compare([\"ID\"], base_df, compare_df)\n>>> print(\"values_summary()\")\nvalues_summary()\n>>> print(compare_result.values_summary())\nshape: (2, 3)\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Value Differences       \u2506 Count \u2506 Percentage \u2502\n\u2502 ---                     \u2506 ---   \u2506 ---        \u2502\n\u2502 str                     \u2506 i64   \u2506 f64        \u2502\n\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n\u2502 Total Value Differences \u2506 1     \u2506 50.0       \u2502\n\u2502 Example1                \u2506 1     \u2506 50.0       \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n>>> print(\"values_sample()\")\nvalues_sample()\n>>> print(compare_result.values_sample())\nshape: (1, 4)\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 ID      \u2506 variable \u2506 base \u2506 compare \u2502\n\u2502 ---     \u2506 ---      \u2506 ---  \u2506 ---     \u2502\n\u2502 str     \u2506 str      \u2506 i64  \u2506 i64     \u2502\n\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n\u2502 1234567 \u2506 Example1 \u2506 6    \u2506 2       \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n>>>\n```\n\n\n### Full report \n\n```python\n>>> import polars as pl\n>>> from pl_compare import compare\n>>>\n>>> base_df = pl.DataFrame(\n...     {\n...         \"ID\": [\"123456\", \"1234567\", \"12345678\"],\n...         \"Example1\": [1, 6, 3],\n...         \"Example2\": [\"1\", \"2\", \"3\"],\n...     }\n... )\n>>> compare_df = pl.DataFrame(\n...     {\n...         \"ID\": [\"123456\", \"1234567\", \"1234567810\"],\n...         \"Example1\": [1, 2, 3],\n...         \"Example2\": [1, 2, 3],\n...         \"Example3\": [1, 2, 3],\n...     },\n... )\n>>>\n>>> compare_result = compare([\"ID\"], base_df, compare_df)\n>>> compare_result.report()\n--------------------------------------------------------------------------------\nCOMPARISON REPORT\n--------------------------------------------------------------------------------\n<BLANKLINE>\nSCHEMA DIFFERENCES:\nshape: (6, 2)\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Statistic                       \u2506 Count \u2502\n\u2502 ---                             \u2506 ---   \u2502\n\u2502 str                             \u2506 i64   \u2502\n\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n\u2502 Columns in base                 \u2506 3     \u2502\n\u2502 Columns in compare              \u2506 4     \u2502\n\u2502 Columns in base and compare     \u2506 3     \u2502\n\u2502 Columns only in base            \u2506 0     \u2502\n\u2502 Columns only in compare         \u2506 1     \u2502\n\u2502 Columns with schema difference... \u2506 1     \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\nshape: (2, 3)\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 column   \u2506 base_format \u2506 compare_format \u2502\n\u2502 ---      \u2506 ---         \u2506 ---            \u2502\n\u2502 str      \u2506 str         \u2506 str            \u2502\n\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n\u2502 Example2 \u2506 String      \u2506 Int64          \u2502\n\u2502 Example3 \u2506 null        \u2506 Int64          \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n--------------------------------------------------------------------------------\n<BLANKLINE>\nROW DIFFERENCES:\nshape: (5, 2)\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Statistic                \u2506 Count \u2502\n\u2502 ---                      \u2506 ---   \u2502\n\u2502 str                      \u2506 i64   \u2502\n\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n\u2502 Rows in base             \u2506 3     \u2502\n\u2502 Rows in compare          \u2506 3     \u2502\n\u2502 Rows only in base        \u2506 1     \u2502\n\u2502 Rows only in compare     \u2506 1     \u2502\n\u2502 Rows in base and compare \u2506 2     \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\nshape: (2, 3)\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 ID         \u2506 variable \u2506 value           \u2502\n\u2502 ---        \u2506 ---      \u2506 ---             \u2502\n\u2502 str        \u2506 str      \u2506 str             \u2502\n\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n\u2502 12345678   \u2506 status   \u2506 in base only    \u2502\n\u2502 1234567810 \u2506 status   \u2506 in compare only \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n--------------------------------------------------------------------------------\n<BLANKLINE>\nVALUE DIFFERENCES:\nshape: (2, 3)\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Value Differences       \u2506 Count \u2506 Percentage \u2502\n\u2502 ---                     \u2506 ---   \u2506 ---        \u2502\n\u2502 str                     \u2506 i64   \u2506 f64        \u2502\n\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n\u2502 Total Value Differences \u2506 1     \u2506 50.0       \u2502\n\u2502 Example1                \u2506 1     \u2506 50.0       \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\nshape: (1, 4)\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 ID      \u2506 variable \u2506 base \u2506 compare \u2502\n\u2502 ---     \u2506 ---      \u2506 ---  \u2506 ---     \u2502\n\u2502 str     \u2506 str      \u2506 i64  \u2506 i64     \u2502\n\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n\u2502 1234567 \u2506 Example1 \u2506 6    \u2506 2       \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n--------------------------------------------------------------------------------\nEnd of Report\n--------------------------------------------------------------------------------\n>>>\n```\n\n\n### Compare two pandas dataframes \n\n\n```python\n>>> import polars as pl\n>>> import pandas as pd # doctest: +SKIP\n>>> from pl_compare import compare\n>>>\n>>> base_df = pd.DataFrame(data=\n...     {\n...         \"ID\": [\"123456\", \"1234567\", \"12345678\"],\n...         \"Example1\": [1, 6, 3],\n...         \"Example2\": [\"1\", \"2\", \"3\"],\n...     }\n... )# doctest: +SKIP\n>>> compare_df = pd.DataFrame(data=\n...     {\n...         \"ID\": [\"123456\", \"1234567\", \"1234567810\"],\n...         \"Example1\": [1, 2, 3],\n...         \"Example2\": [1, 2, 3],\n...         \"Example3\": [1, 2, 3],\n...     },\n... )# doctest: +SKIP\n>>>\n>>> compare_result = compare([\"ID\"], pl.from_pandas(base_df), pl.from_pandas(compare_df))# doctest: +SKIP\n>>> compare_result.report()# doctest: +SKIP\n--------------------------------------------------------------------------------\nCOMPARISON REPORT\n--------------------------------------------------------------------------------\n\nSCHEMA DIFFERENCES:\nshape: (6, 2)\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Statistic                       \u2506 Count \u2502\n\u2502 ---                             \u2506 ---   \u2502\n\u2502 str                             \u2506 i64   \u2502\n\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n\u2502 Columns in base                 \u2506 3     \u2502\n\u2502 Columns in compare              \u2506 4     \u2502\n\u2502 Columns in base and compare     \u2506 3     \u2502\n\u2502 Columns only in base            \u2506 0     \u2502\n\u2502 Columns only in compare         \u2506 1     \u2502\n\u2502 Columns with schema differences \u2506 1     \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\nshape: (2, 3)\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 column   \u2506 base_format \u2506 compare_format \u2502\n\u2502 ---      \u2506 ---         \u2506 ---            \u2502\n\u2502 str      \u2506 str         \u2506 str            \u2502\n\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n\u2502 Example2 \u2506 String      \u2506 Int64          \u2502\n\u2502 Example3 \u2506 null        \u2506 Int64          \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n--------------------------------------------------------------------------------\n\nROW DIFFERENCES:\nshape: (5, 2)\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Statistic                \u2506 Count \u2502\n\u2502 ---                      \u2506 ---   \u2502\n\u2502 str                      \u2506 i64   \u2502\n\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n\u2502 Rows in base             \u2506 3     \u2502\n\u2502 Rows in compare          \u2506 3     \u2502\n\u2502 Rows only in base        \u2506 1     \u2502\n\u2502 Rows only in compare     \u2506 1     \u2502\n\u2502 Rows in base and compare \u2506 2     \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\nshape: (2, 3)\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 ID         \u2506 variable \u2506 value           \u2502\n\u2502 ---        \u2506 ---      \u2506 ---             \u2502\n\u2502 str        \u2506 str      \u2506 str             \u2502\n\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n\u2502 12345678   \u2506 status   \u2506 in base only    \u2502\n\u2502 1234567810 \u2506 status   \u2506 in compare only \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n--------------------------------------------------------------------------------\n\nVALUE DIFFERENCES:\nshape: (2, 3)\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Value Differences       \u2506 Count \u2506 Percentage \u2502\n\u2502 ---                     \u2506 ---   \u2506 ---        \u2502\n\u2502 str                     \u2506 i64   \u2506 f64        \u2502\n\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n\u2502 Total Value Differences \u2506 1     \u2506 50.0       \u2502\n\u2502 Example1                \u2506 1     \u2506 50.0       \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\nshape: (1, 4)\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 ID      \u2506 variable \u2506 base \u2506 compare \u2502\n\u2502 ---     \u2506 ---      \u2506 ---  \u2506 ---     \u2502\n\u2502 str     \u2506 str      \u2506 i64  \u2506 i64     \u2502\n\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n\u2502 1234567 \u2506 Example1 \u2506 6    \u2506 2       \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n--------------------------------------------------------------------------------\nEnd of Report\n--------------------------------------------------------------------------------\n>>>\n```\n\n\n\n### Specify a threshold to control the granularity of the comparison for numeric columns. \n\n\n```python\n>>> import polars as pl\n>>> from pl_compare import compare\n>>>\n>>> base_df = pl.DataFrame(\n...     {\n...         \"ID\": [\"123456\", \"1234567\", \"12345678\"],\n...         \"Example1\": [1.111, 6.11, 3.11],\n...     }\n... )\n>>>\n>>> compare_df = pl.DataFrame(\n...     {\n...         \"ID\": [\"123456\", \"1234567\", \"1234567810\"],\n...         \"Example1\": [1.114, 6.14, 3.12],\n...     },\n... )\n>>>\n>>> print(\"With equality_resolution of 0.01\")\nWith equality_resolution of 0.01\n>>> compare_result = compare([\"ID\"], base_df, compare_df, resolution=0.01)\n>>> print(compare_result.values_sample())\nshape: (1, 4)\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 ID      \u2506 variable \u2506 base \u2506 compare \u2502\n\u2502 ---     \u2506 ---      \u2506 ---  \u2506 ---     \u2502\n\u2502 str     \u2506 str      \u2506 f64  \u2506 f64     \u2502\n\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n\u2502 1234567 \u2506 Example1 \u2506 6.11 \u2506 6.14    \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n>>> print(\"With no equality_resolution\")\nWith no equality_resolution\n>>> compare_result = compare([\"ID\"], base_df, compare_df)\n>>> print(compare_result.values_sample())\nshape: (2, 4)\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 ID      \u2506 variable \u2506 base  \u2506 compare \u2502\n\u2502 ---     \u2506 ---      \u2506 ---   \u2506 ---     \u2502\n\u2502 str     \u2506 str      \u2506 f64   \u2506 f64     \u2502\n\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n\u2502 123456  \u2506 Example1 \u2506 1.111 \u2506 1.114   \u2502\n\u2502 1234567 \u2506 Example1 \u2506 6.11  \u2506 6.14    \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n>>>\n```\n\n\n\n### Example using alias for base and compare dataframes. \n\n\n```python\n>>> import polars as pl\n>>> from pl_compare import compare\n>>>\n>>> base_df = pl.DataFrame(\n...     {\n...         \"ID\": [\"123456\", \"1234567\", \"12345678\"],\n...         \"Example1\": [1, 6, 3],\n...         \"Example2\": [\"1\", \"2\", \"3\"],\n...     }\n... )\n>>> compare_df = pl.DataFrame(\n...     {\n...         \"ID\": [\"123456\", \"1234567\", \"1234567810\"],\n...         \"Example1\": [1, 2, 3],\n...         \"Example2\": [1, 2, 3],\n...         \"Example3\": [1, 2, 3],\n...     },\n... )\n>>>\n>>> compare_result = compare([\"ID\"],\n...                          base_df,\n...                          compare_df,\n...                          base_alias=\"before_change\",\n...                          compare_alias=\"after_change\")\n>>>\n>>> print(\"values_summary()\")\nvalues_summary()\n>>> print(compare_result.schemas_sample())\nshape: (2, 3)\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 column   \u2506 before_change_format \u2506 after_change_format \u2502\n\u2502 ---      \u2506 ---                  \u2506 ---                 \u2502\n\u2502 str      \u2506 str                  \u2506 str                 \u2502\n\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n\u2502 Example2 \u2506 String               \u2506 Int64               \u2502\n\u2502 Example3 \u2506 null                 \u2506 Int64               \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n>>> print(\"values_sample()\")\nvalues_sample()\n>>> print(compare_result.values_sample())\nshape: (1, 4)\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 ID      \u2506 variable \u2506 before_change \u2506 after_change \u2502\n\u2502 ---     \u2506 ---      \u2506 ---           \u2506 ---          \u2502\n\u2502 str     \u2506 str      \u2506 i64           \u2506 i64          \u2502\n\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n\u2502 1234567 \u2506 Example1 \u2506 6             \u2506 2            \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n>>>\n```\n\n\n### Assert two frames are equal for a test \n\n\n```python\n>>> import polars as pl\n>>> import pytest\n>>> from pl_compare.compare import compare\n>>>\n>>> def test_example():\n...     base_df = pl.DataFrame(\n...         {\n...             \"ID\": [\"123456\", \"1234567\", \"12345678\"],\n...             \"Example1\": [1, 6, 3],\n...             \"Example2\": [1, 2, 3],\n...         }\n...     )\n...     compare_df = pl.DataFrame(\n...         {\n...             \"ID\": [\"123456\", \"1234567\", \"12345678\"],\n...             \"Example1\": [1, 6, 9],\n...             \"Example2\": [1, 2, 3],\n...         }\n...     )\n...     comparison = compare([\"ID\"], base_df, compare_df)\n...     if not comparison.is_equal():\n...         raise Exception(comparison.report())\n...\n>>> test_example() # doctest: +IGNORE_EXCEPTION_DETAIL\nTraceback (most recent call last):\n  File \"<stdin>\", line 1, in <module>\n  File \"<stdin>\", line 18, in test_example\nException: --------------------------------------------------------------------------------\nCOMPARISON REPORT\n--------------------------------------------------------------------------------\nNo Schema differences found.\n--------------------------------------------------------------------------------\nNo Row differences found (when joining by the supplied id_columns).\n--------------------------------------------------------------------------------\n\nVALUE DIFFERENCES:\nshape: (3, 3)\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Value Differences       \u2506 Count \u2506 Percentage \u2502\n\u2502 ---                     \u2506 ---   \u2506 ---        \u2502\n\u2502 str                     \u2506 i64   \u2506 f64        \u2502\n\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n\u2502 Total Value Differences \u2506 1     \u2506 16.666667  \u2502\n\u2502 Example1                \u2506 1     \u2506 33.333333  \u2502\n\u2502 Example2                \u2506 0     \u2506 0.0        \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\nshape: (1, 4)\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 ID       \u2506 variable \u2506 base \u2506 compare \u2502\n\u2502 ---      \u2506 ---      \u2506 ---  \u2506 ---     \u2502\n\u2502 str      \u2506 str      \u2506 i64  \u2506 i64     \u2502\n\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n\u2502 12345678 \u2506 Example1 \u2506 3    \u2506 9       \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n--------------------------------------------------------------------------------\nEnd of Report\n--------------------------------------------------------------------------------\n>>>\n```\n\n\n\n### To DO:\n- [x] Linting (Ruff)\n- [x] Make into python package\n- [x] Add makefile for easy linting and tests\n- [x] Statistics should indicate which statistics are referencing columns\n- [x] Add all statistics frame to tests\n- [x] Add schema differences to schema summary\n- [x] Make row examples alternate between base only and compare only so that it is more readable.\n- [x] Add limit value to the examples.\n- [x] Updated value differences summary so that Statistic is something that makes sense.\n- [x] Publish package to pypi\n- [x] Add difference criterion.\n- [x] Add license\n- [x] Make package easy to use (i.e. so you only have to import pl_compare and then you can us pl_compare)\n- [x] Add table name labels that can replace 'base' and 'compare'.\n- [x] Update code to use a config dataclass that can be passed between the class and functions.\n- [x] Write up docstrings\n- [x] Write up readme (with code examples)\n- [x] Add parameter to hide column differences with 0 differences.\n- [x] Add flag to indicate if there are differences between the tables.\n- [x] Update report so that non differences are not displayed.\n- [x] Seperate out dev dependencies from library dependencies?\n- [x] Change 'threshold' to be equality resolution.\n- [x] strict MyPy type checking\n- [x] Raise error and print examples if duplicates are present.\n- [x] Add total number of value differences to the value differences summary.\n- [x] Add percentage column so the value differences summary.\n- [x] Change id_columns to be named 'join_columns' \n- [x] Github actions for publishing\n- [x] Update the duplication validation.\n- [x] Fix report output when tables are exactly equal.\n- [x] Github actions for testing\n- [x] Github actions for linting\n- [x] Add message when there are no columns left to be compared.\n- [x] Add message when df's are exactly equal. \n- [x] Add test case with exactly equal dfs.\n- [x] Add test case with no columns being compared. Make sure an error is raised when the value_summary and value_sample methods are called.\n- [] Test for large amounts of data\n- [] Benchmark for different sizes of data.\n- [] Investigate use for very large datasets 50GB-100GB. Can this be done using LazyFrames only?\n- [] There still seems to be a bug when converting from lazy to data frame using streaming (i.e. in the convert_to_dataframe function)\n\n## Ideas:\n- [] Simplify custom equality checks and add example.\n- [] Add a count of the number of rows that have any differences to the value differences summary.\n- [] add a test that checks that abritrary join conditions work.\n\n\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A tool to find the differences between two tables.",
    "version": "0.6.0",
    "project_urls": {
        "Changelog": "https://github.com/concur1/pl_compare/releases",
        "Repository": "https://github.com/concur1/pl_compare"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ad3a22d09d0a9a015420af9b76759af940cf12254a969b0ab2873ebe47d29abd",
                "md5": "aac8566a447524b4dab92ee55f8e33f4",
                "sha256": "1786a45b7bb7b5f87bd7a5a8f5b41ac13e25de3d01b4414161b50b63877de575"
            },
            "downloads": -1,
            "filename": "pl_compare-0.6.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "aac8566a447524b4dab92ee55f8e33f4",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.8",
            "size": 15497,
            "upload_time": "2024-10-27T15:32:04",
            "upload_time_iso_8601": "2024-10-27T15:32:04.449904Z",
            "url": "https://files.pythonhosted.org/packages/ad/3a/22d09d0a9a015420af9b76759af940cf12254a969b0ab2873ebe47d29abd/pl_compare-0.6.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "743b0a8daf6bf70d7fe2fb9eaaf6e26f93c4993ba82d3c2bba18c8e309cf778d",
                "md5": "6ca4a317c22a08b9bb3f43c3ee51806c",
                "sha256": "cbcf66b3661fc7556d1ad22a9d6d3459abd2775da75f0c4f7efe3e0bc8425112"
            },
            "downloads": -1,
            "filename": "pl_compare-0.6.0.tar.gz",
            "has_sig": false,
            "md5_digest": "6ca4a317c22a08b9bb3f43c3ee51806c",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.8",
            "size": 16560,
            "upload_time": "2024-10-27T15:32:05",
            "upload_time_iso_8601": "2024-10-27T15:32:05.992433Z",
            "url": "https://files.pythonhosted.org/packages/74/3b/0a8daf6bf70d7fe2fb9eaaf6e26f93c4993ba82d3c2bba18c8e309cf778d/pl_compare-0.6.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-10-27 15:32:05",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "concur1",
    "github_project": "pl_compare",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "pl-compare"
}
        
Elapsed time: 0.51772s