Name | pl-compare JSON |
Version |
0.6.0
JSON |
| download |
home_page | None |
Summary | A tool to find the differences between two tables. |
upload_time | 2024-10-27 15:32:05 |
maintainer | None |
docs_url | None |
author | Your Name |
requires_python | <4.0,>=3.8 |
license | None |
keywords |
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# pl_compare: Compare and find the differences between two Polars DataFrames.
[Github](https://github.com/concur1/pl_compare) - [PyPi Page](https://pypi.org/project/pl-compare/)
**You will find pl-compare useful if you find yourself writing various SQL/Dataframe operations to**:
- Understand how well two tables Reconcile [example](#Full-report)
- Find the schema differences between two tables [example](#Schema-differences-summary-and-details)
- Find counts or examples of rows that exist in one table but not another [example](#Row-differences-summary-and-details)
- Find counts or examples of value differences between two tables [example](#Value-differences-summary-and-details)
- Assert that two tables are exactly equal (such as for an automated test) [example](#Assert-two-frames-are-equal-for-a-test)
- Assert that two tables have matching schemas, rows or column values [example](#Return-booleans-to-check-for-schema-row-and-value-differences)
[Click for a jupyter notebook with example usage](https://github.com/concur1/pl_compare/blob/main/pl_compare_demo.ipynb)
![](demo1.gif)
**With pl-compare you can**:
- Get statistical summaries and/or examples and/or a boolean to indicate:
- Schema differences
- Row differences
- Value differences
- Easily works for Pandas dataframes and other tabular data formats with conversion using Apache arrow
- View differences as a text report
- Get differences as a Polars LazyFrame or DataFrame
- Use LazyFrames for larger than memory comparisons
- Specify the equality calculation that is used to dermine value differences
## Installation
```zsh
pip install pl_compare
```
## Examples (click to expand)
### Return booleans to check for schema, row and value differences
```python
>>> import polars as pl
>>> from pl_compare import compare
>>>
>>> base_df = pl.DataFrame(
... {
... "ID": ["123456", "1234567", "12345678"],
... "Example1": [1, 6, 3],
... "Example2": ["1", "2", "3"],
... }
... )
>>> compare_df = pl.DataFrame(
... {
... "ID": ["123456", "1234567", "1234567810"],
... "Example1": [1, 2, 3],
... "Example2": [1, 2, 3],
... "Example3": [1, 2, 3],
... },
... )
>>>
>>> compare_result = compare(["ID"], base_df, compare_df)
>>> print("is_schemas_equal:", compare_result.is_schemas_equal())
is_schemas_equal: False
>>> print("is_rows_equal:", compare_result.is_rows_equal())
is_rows_equal: False
>>> print("is_values_equal:", compare_result.is_values_equal())
is_values_equal: False
>>>
```
### Schema differences summary and details
```python
>>> import polars as pl
>>> from pl_compare import compare
>>>
>>> base_df = pl.DataFrame(
... {
... "ID": ["123456", "1234567", "12345678"],
... "Example1": [1, 6, 3],
... "Example2": ["1", "2", "3"],
... }
... )
>>> compare_df = pl.DataFrame(
... {
... "ID": ["123456", "1234567", "1234567810"],
... "Example1": [1, 2, 3],
... "Example2": [1, 2, 3],
... "Example3": [1, 2, 3],
... },
... )
>>>
>>> compare_result = compare(["ID"], base_df, compare_df)
>>> print("schemas_summary()")
schemas_summary()
>>> print(compare_result.schemas_summary())
shape: (6, 2)
┌─────────────────────────────────┬───────┐
│ Statistic ┆ Count │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════════════════════════════════╪═══════╡
│ Columns in base ┆ 3 │
│ Columns in compare ┆ 4 │
│ Columns in base and compare ┆ 3 │
│ Columns only in base ┆ 0 │
│ Columns only in compare ┆ 1 │
│ Columns with schema difference... ┆ 1 │
└─────────────────────────────────┴───────┘
>>> print("schemas_sample()")
schemas_sample()
>>> print(compare_result.schemas_sample())
shape: (2, 3)
┌──────────┬─────────────┬────────────────┐
│ column ┆ base_format ┆ compare_format │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞══════════╪═════════════╪════════════════╡
│ Example2 ┆ String ┆ Int64 │
│ Example3 ┆ null ┆ Int64 │
└──────────┴─────────────┴────────────────┘
>>>
```
### Row differences summary and details
```python
>>> import polars as pl
>>> from pl_compare import compare
>>>
>>> base_df = pl.DataFrame(
... {
... "ID": ["123456", "1234567", "12345678"],
... "Example1": [1, 6, 3],
... "Example2": ["1", "2", "3"],
... }
... )
>>> compare_df = pl.DataFrame(
... {
... "ID": ["123456", "1234567", "1234567810"],
... "Example1": [1, 2, 3],
... "Example2": [1, 2, 3],
... "Example3": [1, 2, 3],
... },
... )
>>>
>>> compare_result = compare(["ID"], base_df, compare_df)
>>> print("rows_summary()")
rows_summary()
>>> print(compare_result.rows_summary())
shape: (5, 2)
┌──────────────────────────┬───────┐
│ Statistic ┆ Count │
│ --- ┆ --- │
│ str ┆ i64 │
╞══════════════════════════╪═══════╡
│ Rows in base ┆ 3 │
│ Rows in compare ┆ 3 │
│ Rows only in base ┆ 1 │
│ Rows only in compare ┆ 1 │
│ Rows in base and compare ┆ 2 │
└──────────────────────────┴───────┘
>>> print("rows_sample()")
rows_sample()
>>> print(compare_result.rows_sample())
shape: (2, 3)
┌────────────┬──────────┬─────────────────┐
│ ID ┆ variable ┆ value │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞════════════╪══════════╪═════════════════╡
│ 12345678 ┆ status ┆ in base only │
│ 1234567810 ┆ status ┆ in compare only │
└────────────┴──────────┴─────────────────┘
>>>
```
### Value differences summary and details
```python
>>> import polars as pl
>>> from pl_compare import compare
>>>
>>> base_df = pl.DataFrame(
... {
... "ID": ["123456", "1234567", "12345678"],
... "Example1": [1, 6, 3],
... "Example2": ["1", "2", "3"],
... }
... )
>>> compare_df = pl.DataFrame(
... {
... "ID": ["123456", "1234567", "1234567810"],
... "Example1": [1, 2, 3],
... "Example2": [1, 2, 3],
... "Example3": [1, 2, 3],
... },
... )
>>>
>>> compare_result = compare(["ID"], base_df, compare_df)
>>> print("values_summary()")
values_summary()
>>> print(compare_result.values_summary())
shape: (2, 3)
┌─────────────────────────┬───────┬────────────┐
│ Value Differences ┆ Count ┆ Percentage │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ f64 │
╞═════════════════════════╪═══════╪════════════╡
│ Total Value Differences ┆ 1 ┆ 50.0 │
│ Example1 ┆ 1 ┆ 50.0 │
└─────────────────────────┴───────┴────────────┘
>>> print("values_sample()")
values_sample()
>>> print(compare_result.values_sample())
shape: (1, 4)
┌─────────┬──────────┬──────┬─────────┐
│ ID ┆ variable ┆ base ┆ compare │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 ┆ i64 │
╞═════════╪══════════╪══════╪═════════╡
│ 1234567 ┆ Example1 ┆ 6 ┆ 2 │
└─────────┴──────────┴──────┴─────────┘
>>>
```
### Full report
```python
>>> import polars as pl
>>> from pl_compare import compare
>>>
>>> base_df = pl.DataFrame(
... {
... "ID": ["123456", "1234567", "12345678"],
... "Example1": [1, 6, 3],
... "Example2": ["1", "2", "3"],
... }
... )
>>> compare_df = pl.DataFrame(
... {
... "ID": ["123456", "1234567", "1234567810"],
... "Example1": [1, 2, 3],
... "Example2": [1, 2, 3],
... "Example3": [1, 2, 3],
... },
... )
>>>
>>> compare_result = compare(["ID"], base_df, compare_df)
>>> compare_result.report()
--------------------------------------------------------------------------------
COMPARISON REPORT
--------------------------------------------------------------------------------
<BLANKLINE>
SCHEMA DIFFERENCES:
shape: (6, 2)
┌─────────────────────────────────┬───────┐
│ Statistic ┆ Count │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════════════════════════════════╪═══════╡
│ Columns in base ┆ 3 │
│ Columns in compare ┆ 4 │
│ Columns in base and compare ┆ 3 │
│ Columns only in base ┆ 0 │
│ Columns only in compare ┆ 1 │
│ Columns with schema difference... ┆ 1 │
└─────────────────────────────────┴───────┘
shape: (2, 3)
┌──────────┬─────────────┬────────────────┐
│ column ┆ base_format ┆ compare_format │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞══════════╪═════════════╪════════════════╡
│ Example2 ┆ String ┆ Int64 │
│ Example3 ┆ null ┆ Int64 │
└──────────┴─────────────┴────────────────┘
--------------------------------------------------------------------------------
<BLANKLINE>
ROW DIFFERENCES:
shape: (5, 2)
┌──────────────────────────┬───────┐
│ Statistic ┆ Count │
│ --- ┆ --- │
│ str ┆ i64 │
╞══════════════════════════╪═══════╡
│ Rows in base ┆ 3 │
│ Rows in compare ┆ 3 │
│ Rows only in base ┆ 1 │
│ Rows only in compare ┆ 1 │
│ Rows in base and compare ┆ 2 │
└──────────────────────────┴───────┘
shape: (2, 3)
┌────────────┬──────────┬─────────────────┐
│ ID ┆ variable ┆ value │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞════════════╪══════════╪═════════════════╡
│ 12345678 ┆ status ┆ in base only │
│ 1234567810 ┆ status ┆ in compare only │
└────────────┴──────────┴─────────────────┘
--------------------------------------------------------------------------------
<BLANKLINE>
VALUE DIFFERENCES:
shape: (2, 3)
┌─────────────────────────┬───────┬────────────┐
│ Value Differences ┆ Count ┆ Percentage │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ f64 │
╞═════════════════════════╪═══════╪════════════╡
│ Total Value Differences ┆ 1 ┆ 50.0 │
│ Example1 ┆ 1 ┆ 50.0 │
└─────────────────────────┴───────┴────────────┘
shape: (1, 4)
┌─────────┬──────────┬──────┬─────────┐
│ ID ┆ variable ┆ base ┆ compare │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 ┆ i64 │
╞═════════╪══════════╪══════╪═════════╡
│ 1234567 ┆ Example1 ┆ 6 ┆ 2 │
└─────────┴──────────┴──────┴─────────┘
--------------------------------------------------------------------------------
End of Report
--------------------------------------------------------------------------------
>>>
```
### Compare two pandas dataframes
```python
>>> import polars as pl
>>> import pandas as pd # doctest: +SKIP
>>> from pl_compare import compare
>>>
>>> base_df = pd.DataFrame(data=
... {
... "ID": ["123456", "1234567", "12345678"],
... "Example1": [1, 6, 3],
... "Example2": ["1", "2", "3"],
... }
... )# doctest: +SKIP
>>> compare_df = pd.DataFrame(data=
... {
... "ID": ["123456", "1234567", "1234567810"],
... "Example1": [1, 2, 3],
... "Example2": [1, 2, 3],
... "Example3": [1, 2, 3],
... },
... )# doctest: +SKIP
>>>
>>> compare_result = compare(["ID"], pl.from_pandas(base_df), pl.from_pandas(compare_df))# doctest: +SKIP
>>> compare_result.report()# doctest: +SKIP
--------------------------------------------------------------------------------
COMPARISON REPORT
--------------------------------------------------------------------------------
SCHEMA DIFFERENCES:
shape: (6, 2)
┌─────────────────────────────────┬───────┐
│ Statistic ┆ Count │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════════════════════════════════╪═══════╡
│ Columns in base ┆ 3 │
│ Columns in compare ┆ 4 │
│ Columns in base and compare ┆ 3 │
│ Columns only in base ┆ 0 │
│ Columns only in compare ┆ 1 │
│ Columns with schema differences ┆ 1 │
└─────────────────────────────────┴───────┘
shape: (2, 3)
┌──────────┬─────────────┬────────────────┐
│ column ┆ base_format ┆ compare_format │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞══════════╪═════════════╪════════════════╡
│ Example2 ┆ String ┆ Int64 │
│ Example3 ┆ null ┆ Int64 │
└──────────┴─────────────┴────────────────┘
--------------------------------------------------------------------------------
ROW DIFFERENCES:
shape: (5, 2)
┌──────────────────────────┬───────┐
│ Statistic ┆ Count │
│ --- ┆ --- │
│ str ┆ i64 │
╞══════════════════════════╪═══════╡
│ Rows in base ┆ 3 │
│ Rows in compare ┆ 3 │
│ Rows only in base ┆ 1 │
│ Rows only in compare ┆ 1 │
│ Rows in base and compare ┆ 2 │
└──────────────────────────┴───────┘
shape: (2, 3)
┌────────────┬──────────┬─────────────────┐
│ ID ┆ variable ┆ value │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞════════════╪══════════╪═════════════════╡
│ 12345678 ┆ status ┆ in base only │
│ 1234567810 ┆ status ┆ in compare only │
└────────────┴──────────┴─────────────────┘
--------------------------------------------------------------------------------
VALUE DIFFERENCES:
shape: (2, 3)
┌─────────────────────────┬───────┬────────────┐
│ Value Differences ┆ Count ┆ Percentage │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ f64 │
╞═════════════════════════╪═══════╪════════════╡
│ Total Value Differences ┆ 1 ┆ 50.0 │
│ Example1 ┆ 1 ┆ 50.0 │
└─────────────────────────┴───────┴────────────┘
shape: (1, 4)
┌─────────┬──────────┬──────┬─────────┐
│ ID ┆ variable ┆ base ┆ compare │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 ┆ i64 │
╞═════════╪══════════╪══════╪═════════╡
│ 1234567 ┆ Example1 ┆ 6 ┆ 2 │
└─────────┴──────────┴──────┴─────────┘
--------------------------------------------------------------------------------
End of Report
--------------------------------------------------------------------------------
>>>
```
### Specify a threshold to control the granularity of the comparison for numeric columns.
```python
>>> import polars as pl
>>> from pl_compare import compare
>>>
>>> base_df = pl.DataFrame(
... {
... "ID": ["123456", "1234567", "12345678"],
... "Example1": [1.111, 6.11, 3.11],
... }
... )
>>>
>>> compare_df = pl.DataFrame(
... {
... "ID": ["123456", "1234567", "1234567810"],
... "Example1": [1.114, 6.14, 3.12],
... },
... )
>>>
>>> print("With equality_resolution of 0.01")
With equality_resolution of 0.01
>>> compare_result = compare(["ID"], base_df, compare_df, resolution=0.01)
>>> print(compare_result.values_sample())
shape: (1, 4)
┌─────────┬──────────┬──────┬─────────┐
│ ID ┆ variable ┆ base ┆ compare │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 ┆ f64 │
╞═════════╪══════════╪══════╪═════════╡
│ 1234567 ┆ Example1 ┆ 6.11 ┆ 6.14 │
└─────────┴──────────┴──────┴─────────┘
>>> print("With no equality_resolution")
With no equality_resolution
>>> compare_result = compare(["ID"], base_df, compare_df)
>>> print(compare_result.values_sample())
shape: (2, 4)
┌─────────┬──────────┬───────┬─────────┐
│ ID ┆ variable ┆ base ┆ compare │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 ┆ f64 │
╞═════════╪══════════╪═══════╪═════════╡
│ 123456 ┆ Example1 ┆ 1.111 ┆ 1.114 │
│ 1234567 ┆ Example1 ┆ 6.11 ┆ 6.14 │
└─────────┴──────────┴───────┴─────────┘
>>>
```
### Example using alias for base and compare dataframes.
```python
>>> import polars as pl
>>> from pl_compare import compare
>>>
>>> base_df = pl.DataFrame(
... {
... "ID": ["123456", "1234567", "12345678"],
... "Example1": [1, 6, 3],
... "Example2": ["1", "2", "3"],
... }
... )
>>> compare_df = pl.DataFrame(
... {
... "ID": ["123456", "1234567", "1234567810"],
... "Example1": [1, 2, 3],
... "Example2": [1, 2, 3],
... "Example3": [1, 2, 3],
... },
... )
>>>
>>> compare_result = compare(["ID"],
... base_df,
... compare_df,
... base_alias="before_change",
... compare_alias="after_change")
>>>
>>> print("values_summary()")
values_summary()
>>> print(compare_result.schemas_sample())
shape: (2, 3)
┌──────────┬──────────────────────┬─────────────────────┐
│ column ┆ before_change_format ┆ after_change_format │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞══════════╪══════════════════════╪═════════════════════╡
│ Example2 ┆ String ┆ Int64 │
│ Example3 ┆ null ┆ Int64 │
└──────────┴──────────────────────┴─────────────────────┘
>>> print("values_sample()")
values_sample()
>>> print(compare_result.values_sample())
shape: (1, 4)
┌─────────┬──────────┬───────────────┬──────────────┐
│ ID ┆ variable ┆ before_change ┆ after_change │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 ┆ i64 │
╞═════════╪══════════╪═══════════════╪══════════════╡
│ 1234567 ┆ Example1 ┆ 6 ┆ 2 │
└─────────┴──────────┴───────────────┴──────────────┘
>>>
```
### Assert two frames are equal for a test
```python
>>> import polars as pl
>>> import pytest
>>> from pl_compare.compare import compare
>>>
>>> def test_example():
... base_df = pl.DataFrame(
... {
... "ID": ["123456", "1234567", "12345678"],
... "Example1": [1, 6, 3],
... "Example2": [1, 2, 3],
... }
... )
... compare_df = pl.DataFrame(
... {
... "ID": ["123456", "1234567", "12345678"],
... "Example1": [1, 6, 9],
... "Example2": [1, 2, 3],
... }
... )
... comparison = compare(["ID"], base_df, compare_df)
... if not comparison.is_equal():
... raise Exception(comparison.report())
...
>>> test_example() # doctest: +IGNORE_EXCEPTION_DETAIL
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 18, in test_example
Exception: --------------------------------------------------------------------------------
COMPARISON REPORT
--------------------------------------------------------------------------------
No Schema differences found.
--------------------------------------------------------------------------------
No Row differences found (when joining by the supplied id_columns).
--------------------------------------------------------------------------------
VALUE DIFFERENCES:
shape: (3, 3)
┌─────────────────────────┬───────┬────────────┐
│ Value Differences ┆ Count ┆ Percentage │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ f64 │
╞═════════════════════════╪═══════╪════════════╡
│ Total Value Differences ┆ 1 ┆ 16.666667 │
│ Example1 ┆ 1 ┆ 33.333333 │
│ Example2 ┆ 0 ┆ 0.0 │
└─────────────────────────┴───────┴────────────┘
shape: (1, 4)
┌──────────┬──────────┬──────┬─────────┐
│ ID ┆ variable ┆ base ┆ compare │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 ┆ i64 │
╞══════════╪══════════╪══════╪═════════╡
│ 12345678 ┆ Example1 ┆ 3 ┆ 9 │
└──────────┴──────────┴──────┴─────────┘
--------------------------------------------------------------------------------
End of Report
--------------------------------------------------------------------------------
>>>
```
### To DO:
- [x] Linting (Ruff)
- [x] Make into python package
- [x] Add makefile for easy linting and tests
- [x] Statistics should indicate which statistics are referencing columns
- [x] Add all statistics frame to tests
- [x] Add schema differences to schema summary
- [x] Make row examples alternate between base only and compare only so that it is more readable.
- [x] Add limit value to the examples.
- [x] Updated value differences summary so that Statistic is something that makes sense.
- [x] Publish package to pypi
- [x] Add difference criterion.
- [x] Add license
- [x] Make package easy to use (i.e. so you only have to import pl_compare and then you can us pl_compare)
- [x] Add table name labels that can replace 'base' and 'compare'.
- [x] Update code to use a config dataclass that can be passed between the class and functions.
- [x] Write up docstrings
- [x] Write up readme (with code examples)
- [x] Add parameter to hide column differences with 0 differences.
- [x] Add flag to indicate if there are differences between the tables.
- [x] Update report so that non differences are not displayed.
- [x] Seperate out dev dependencies from library dependencies?
- [x] Change 'threshold' to be equality resolution.
- [x] strict MyPy type checking
- [x] Raise error and print examples if duplicates are present.
- [x] Add total number of value differences to the value differences summary.
- [x] Add percentage column so the value differences summary.
- [x] Change id_columns to be named 'join_columns'
- [x] Github actions for publishing
- [x] Update the duplication validation.
- [x] Fix report output when tables are exactly equal.
- [x] Github actions for testing
- [x] Github actions for linting
- [x] Add message when there are no columns left to be compared.
- [x] Add message when df's are exactly equal.
- [x] Add test case with exactly equal dfs.
- [x] Add test case with no columns being compared. Make sure an error is raised when the value_summary and value_sample methods are called.
- [] Test for large amounts of data
- [] Benchmark for different sizes of data.
- [] Investigate use for very large datasets 50GB-100GB. Can this be done using LazyFrames only?
- [] There still seems to be a bug when converting from lazy to data frame using streaming (i.e. in the convert_to_dataframe function)
## Ideas:
- [] Simplify custom equality checks and add example.
- [] Add a count of the number of rows that have any differences to the value differences summary.
- [] add a test that checks that abritrary join conditions work.
Raw data
{
"_id": null,
"home_page": null,
"name": "pl-compare",
"maintainer": null,
"docs_url": null,
"requires_python": "<4.0,>=3.8",
"maintainer_email": null,
"keywords": null,
"author": "Your Name",
"author_email": "you@example.com",
"download_url": "https://files.pythonhosted.org/packages/74/3b/0a8daf6bf70d7fe2fb9eaaf6e26f93c4993ba82d3c2bba18c8e309cf778d/pl_compare-0.6.0.tar.gz",
"platform": null,
"description": "# pl_compare: Compare and find the differences between two Polars DataFrames. \n\n[Github](https://github.com/concur1/pl_compare) - [PyPi Page](https://pypi.org/project/pl-compare/)\n\n**You will find pl-compare useful if you find yourself writing various SQL/Dataframe operations to**:\n- Understand how well two tables Reconcile [example](#Full-report)\n- Find the schema differences between two tables [example](#Schema-differences-summary-and-details)\n- Find counts or examples of rows that exist in one table but not another [example](#Row-differences-summary-and-details)\n- Find counts or examples of value differences between two tables [example](#Value-differences-summary-and-details)\n- Assert that two tables are exactly equal (such as for an automated test) [example](#Assert-two-frames-are-equal-for-a-test)\n- Assert that two tables have matching schemas, rows or column values [example](#Return-booleans-to-check-for-schema-row-and-value-differences)\n\n[Click for a jupyter notebook with example usage](https://github.com/concur1/pl_compare/blob/main/pl_compare_demo.ipynb)\n\n![](demo1.gif)\n\n**With pl-compare you can**:\n- Get statistical summaries and/or examples and/or a boolean to indicate:\n - Schema differences\n - Row differences\n - Value differences\n- Easily works for Pandas dataframes and other tabular data formats with conversion using Apache arrow \n- View differences as a text report\n- Get differences as a Polars LazyFrame or DataFrame\n- Use LazyFrames for larger than memory comparisons\n- Specify the equality calculation that is used to dermine value differences\n\n\n## Installation\n\n```zsh\npip install pl_compare\n```\n\n## Examples (click to expand)\n\n### Return booleans to check for schema, row and value differences \n\n\n```python\n>>> import polars as pl\n>>> from pl_compare import compare\n>>>\n>>> base_df = pl.DataFrame(\n... {\n... \"ID\": [\"123456\", \"1234567\", \"12345678\"],\n... \"Example1\": [1, 6, 3],\n... \"Example2\": [\"1\", \"2\", \"3\"],\n... }\n... )\n>>> compare_df = pl.DataFrame(\n... {\n... \"ID\": [\"123456\", \"1234567\", \"1234567810\"],\n... \"Example1\": [1, 2, 3],\n... \"Example2\": [1, 2, 3],\n... \"Example3\": [1, 2, 3],\n... },\n... )\n>>>\n>>> compare_result = compare([\"ID\"], base_df, compare_df)\n>>> print(\"is_schemas_equal:\", compare_result.is_schemas_equal())\nis_schemas_equal: False\n>>> print(\"is_rows_equal:\", compare_result.is_rows_equal())\nis_rows_equal: False\n>>> print(\"is_values_equal:\", compare_result.is_values_equal())\nis_values_equal: False\n>>>\n```\n\n\n### Schema differences summary and details \n\n\n```python\n>>> import polars as pl\n>>> from pl_compare import compare\n>>>\n>>> base_df = pl.DataFrame(\n... {\n... \"ID\": [\"123456\", \"1234567\", \"12345678\"],\n... \"Example1\": [1, 6, 3],\n... \"Example2\": [\"1\", \"2\", \"3\"],\n... }\n... )\n>>> compare_df = pl.DataFrame(\n... {\n... \"ID\": [\"123456\", \"1234567\", \"1234567810\"],\n... \"Example1\": [1, 2, 3],\n... \"Example2\": [1, 2, 3],\n... \"Example3\": [1, 2, 3],\n... },\n... )\n>>>\n>>> compare_result = compare([\"ID\"], base_df, compare_df)\n>>> print(\"schemas_summary()\")\nschemas_summary()\n>>> print(compare_result.schemas_summary())\nshape: (6, 2)\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Statistic \u2506 Count \u2502\n\u2502 --- \u2506 --- \u2502\n\u2502 str \u2506 i64 \u2502\n\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n\u2502 Columns in base \u2506 3 \u2502\n\u2502 Columns in compare \u2506 4 \u2502\n\u2502 Columns in base and compare \u2506 3 \u2502\n\u2502 Columns only in base \u2506 0 \u2502\n\u2502 Columns only in compare \u2506 1 \u2502\n\u2502 Columns with schema difference... \u2506 1 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n>>> print(\"schemas_sample()\")\nschemas_sample()\n>>> print(compare_result.schemas_sample())\nshape: (2, 3)\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 column \u2506 base_format \u2506 compare_format \u2502\n\u2502 --- \u2506 --- \u2506 --- \u2502\n\u2502 str \u2506 str \u2506 str \u2502\n\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n\u2502 Example2 \u2506 String \u2506 Int64 \u2502\n\u2502 Example3 \u2506 null \u2506 Int64 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n>>>\n```\n\n\n### Row differences summary and details \n\n\n```python\n>>> import polars as pl\n>>> from pl_compare import compare\n>>>\n>>> base_df = pl.DataFrame(\n... {\n... \"ID\": [\"123456\", \"1234567\", \"12345678\"],\n... \"Example1\": [1, 6, 3],\n... \"Example2\": [\"1\", \"2\", \"3\"],\n... }\n... )\n>>> compare_df = pl.DataFrame(\n... {\n... \"ID\": [\"123456\", \"1234567\", \"1234567810\"],\n... \"Example1\": [1, 2, 3],\n... \"Example2\": [1, 2, 3],\n... \"Example3\": [1, 2, 3],\n... },\n... )\n>>>\n>>> compare_result = compare([\"ID\"], base_df, compare_df)\n>>> print(\"rows_summary()\")\nrows_summary()\n>>> print(compare_result.rows_summary())\nshape: (5, 2)\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Statistic \u2506 Count \u2502\n\u2502 --- \u2506 --- \u2502\n\u2502 str \u2506 i64 \u2502\n\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n\u2502 Rows in base \u2506 3 \u2502\n\u2502 Rows in compare \u2506 3 \u2502\n\u2502 Rows only in base \u2506 1 \u2502\n\u2502 Rows only in compare \u2506 1 \u2502\n\u2502 Rows in base and compare \u2506 2 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n>>> print(\"rows_sample()\")\nrows_sample()\n>>> print(compare_result.rows_sample())\nshape: (2, 3)\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 ID \u2506 variable \u2506 value \u2502\n\u2502 --- \u2506 --- \u2506 --- \u2502\n\u2502 str \u2506 str \u2506 str \u2502\n\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n\u2502 12345678 \u2506 status \u2506 in base only \u2502\n\u2502 1234567810 \u2506 status \u2506 in compare only \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n>>>\n```\n\n\n### Value differences summary and details \n\n\n```python\n>>> import polars as pl\n>>> from pl_compare import compare\n>>>\n>>> base_df = pl.DataFrame(\n... {\n... \"ID\": [\"123456\", \"1234567\", \"12345678\"],\n... \"Example1\": [1, 6, 3],\n... \"Example2\": [\"1\", \"2\", \"3\"],\n... }\n... )\n>>> compare_df = pl.DataFrame(\n... {\n... \"ID\": [\"123456\", \"1234567\", \"1234567810\"],\n... \"Example1\": [1, 2, 3],\n... \"Example2\": [1, 2, 3],\n... \"Example3\": [1, 2, 3],\n... },\n... )\n>>>\n>>> compare_result = compare([\"ID\"], base_df, compare_df)\n>>> print(\"values_summary()\")\nvalues_summary()\n>>> print(compare_result.values_summary())\nshape: (2, 3)\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Value Differences \u2506 Count \u2506 Percentage \u2502\n\u2502 --- \u2506 --- \u2506 --- \u2502\n\u2502 str \u2506 i64 \u2506 f64 \u2502\n\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n\u2502 Total Value Differences \u2506 1 \u2506 50.0 \u2502\n\u2502 Example1 \u2506 1 \u2506 50.0 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n>>> print(\"values_sample()\")\nvalues_sample()\n>>> print(compare_result.values_sample())\nshape: (1, 4)\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 ID \u2506 variable \u2506 base \u2506 compare \u2502\n\u2502 --- \u2506 --- \u2506 --- \u2506 --- \u2502\n\u2502 str \u2506 str \u2506 i64 \u2506 i64 \u2502\n\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n\u2502 1234567 \u2506 Example1 \u2506 6 \u2506 2 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n>>>\n```\n\n\n### Full report \n\n```python\n>>> import polars as pl\n>>> from pl_compare import compare\n>>>\n>>> base_df = pl.DataFrame(\n... {\n... \"ID\": [\"123456\", \"1234567\", \"12345678\"],\n... \"Example1\": [1, 6, 3],\n... \"Example2\": [\"1\", \"2\", \"3\"],\n... }\n... )\n>>> compare_df = pl.DataFrame(\n... {\n... \"ID\": [\"123456\", \"1234567\", \"1234567810\"],\n... \"Example1\": [1, 2, 3],\n... \"Example2\": [1, 2, 3],\n... \"Example3\": [1, 2, 3],\n... },\n... )\n>>>\n>>> compare_result = compare([\"ID\"], base_df, compare_df)\n>>> compare_result.report()\n--------------------------------------------------------------------------------\nCOMPARISON REPORT\n--------------------------------------------------------------------------------\n<BLANKLINE>\nSCHEMA DIFFERENCES:\nshape: (6, 2)\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Statistic \u2506 Count \u2502\n\u2502 --- \u2506 --- \u2502\n\u2502 str \u2506 i64 \u2502\n\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n\u2502 Columns in base \u2506 3 \u2502\n\u2502 Columns in compare \u2506 4 \u2502\n\u2502 Columns in base and compare \u2506 3 \u2502\n\u2502 Columns only in base \u2506 0 \u2502\n\u2502 Columns only in compare \u2506 1 \u2502\n\u2502 Columns with schema difference... \u2506 1 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\nshape: (2, 3)\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 column \u2506 base_format \u2506 compare_format \u2502\n\u2502 --- \u2506 --- \u2506 --- \u2502\n\u2502 str \u2506 str \u2506 str \u2502\n\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n\u2502 Example2 \u2506 String \u2506 Int64 \u2502\n\u2502 Example3 \u2506 null \u2506 Int64 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n--------------------------------------------------------------------------------\n<BLANKLINE>\nROW DIFFERENCES:\nshape: (5, 2)\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Statistic \u2506 Count \u2502\n\u2502 --- \u2506 --- \u2502\n\u2502 str \u2506 i64 \u2502\n\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n\u2502 Rows in base \u2506 3 \u2502\n\u2502 Rows in compare \u2506 3 \u2502\n\u2502 Rows only in base \u2506 1 \u2502\n\u2502 Rows only in compare \u2506 1 \u2502\n\u2502 Rows in base and compare \u2506 2 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\nshape: (2, 3)\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 ID \u2506 variable \u2506 value \u2502\n\u2502 --- \u2506 --- \u2506 --- \u2502\n\u2502 str \u2506 str \u2506 str \u2502\n\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n\u2502 12345678 \u2506 status \u2506 in base only \u2502\n\u2502 1234567810 \u2506 status \u2506 in compare only \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n--------------------------------------------------------------------------------\n<BLANKLINE>\nVALUE DIFFERENCES:\nshape: (2, 3)\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Value Differences \u2506 Count \u2506 Percentage \u2502\n\u2502 --- \u2506 --- \u2506 --- \u2502\n\u2502 str \u2506 i64 \u2506 f64 \u2502\n\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n\u2502 Total Value Differences \u2506 1 \u2506 50.0 \u2502\n\u2502 Example1 \u2506 1 \u2506 50.0 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\nshape: (1, 4)\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 ID \u2506 variable \u2506 base \u2506 compare \u2502\n\u2502 --- \u2506 --- \u2506 --- \u2506 --- \u2502\n\u2502 str \u2506 str \u2506 i64 \u2506 i64 \u2502\n\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n\u2502 1234567 \u2506 Example1 \u2506 6 \u2506 2 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n--------------------------------------------------------------------------------\nEnd of Report\n--------------------------------------------------------------------------------\n>>>\n```\n\n\n### Compare two pandas dataframes \n\n\n```python\n>>> import polars as pl\n>>> import pandas as pd # doctest: +SKIP\n>>> from pl_compare import compare\n>>>\n>>> base_df = pd.DataFrame(data=\n... {\n... \"ID\": [\"123456\", \"1234567\", \"12345678\"],\n... \"Example1\": [1, 6, 3],\n... \"Example2\": [\"1\", \"2\", \"3\"],\n... }\n... )# doctest: +SKIP\n>>> compare_df = pd.DataFrame(data=\n... {\n... \"ID\": [\"123456\", \"1234567\", \"1234567810\"],\n... \"Example1\": [1, 2, 3],\n... \"Example2\": [1, 2, 3],\n... \"Example3\": [1, 2, 3],\n... },\n... )# doctest: +SKIP\n>>>\n>>> compare_result = compare([\"ID\"], pl.from_pandas(base_df), pl.from_pandas(compare_df))# doctest: +SKIP\n>>> compare_result.report()# doctest: +SKIP\n--------------------------------------------------------------------------------\nCOMPARISON REPORT\n--------------------------------------------------------------------------------\n\nSCHEMA DIFFERENCES:\nshape: (6, 2)\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Statistic \u2506 Count \u2502\n\u2502 --- \u2506 --- \u2502\n\u2502 str \u2506 i64 \u2502\n\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n\u2502 Columns in base \u2506 3 \u2502\n\u2502 Columns in compare \u2506 4 \u2502\n\u2502 Columns in base and compare \u2506 3 \u2502\n\u2502 Columns only in base \u2506 0 \u2502\n\u2502 Columns only in compare \u2506 1 \u2502\n\u2502 Columns with schema differences \u2506 1 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\nshape: (2, 3)\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 column \u2506 base_format \u2506 compare_format \u2502\n\u2502 --- \u2506 --- \u2506 --- \u2502\n\u2502 str \u2506 str \u2506 str \u2502\n\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n\u2502 Example2 \u2506 String \u2506 Int64 \u2502\n\u2502 Example3 \u2506 null \u2506 Int64 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n--------------------------------------------------------------------------------\n\nROW DIFFERENCES:\nshape: (5, 2)\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Statistic \u2506 Count \u2502\n\u2502 --- \u2506 --- \u2502\n\u2502 str \u2506 i64 \u2502\n\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n\u2502 Rows in base \u2506 3 \u2502\n\u2502 Rows in compare \u2506 3 \u2502\n\u2502 Rows only in base \u2506 1 \u2502\n\u2502 Rows only in compare \u2506 1 \u2502\n\u2502 Rows in base and compare \u2506 2 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\nshape: (2, 3)\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 ID \u2506 variable \u2506 value \u2502\n\u2502 --- \u2506 --- \u2506 --- \u2502\n\u2502 str \u2506 str \u2506 str \u2502\n\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n\u2502 12345678 \u2506 status \u2506 in base only \u2502\n\u2502 1234567810 \u2506 status \u2506 in compare only \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n--------------------------------------------------------------------------------\n\nVALUE DIFFERENCES:\nshape: (2, 3)\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Value Differences \u2506 Count \u2506 Percentage \u2502\n\u2502 --- \u2506 --- \u2506 --- \u2502\n\u2502 str \u2506 i64 \u2506 f64 \u2502\n\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n\u2502 Total Value Differences \u2506 1 \u2506 50.0 \u2502\n\u2502 Example1 \u2506 1 \u2506 50.0 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\nshape: (1, 4)\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 ID \u2506 variable \u2506 base \u2506 compare \u2502\n\u2502 --- \u2506 --- \u2506 --- \u2506 --- \u2502\n\u2502 str \u2506 str \u2506 i64 \u2506 i64 \u2502\n\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n\u2502 1234567 \u2506 Example1 \u2506 6 \u2506 2 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n--------------------------------------------------------------------------------\nEnd of Report\n--------------------------------------------------------------------------------\n>>>\n```\n\n\n\n### Specify a threshold to control the granularity of the comparison for numeric columns. \n\n\n```python\n>>> import polars as pl\n>>> from pl_compare import compare\n>>>\n>>> base_df = pl.DataFrame(\n... {\n... \"ID\": [\"123456\", \"1234567\", \"12345678\"],\n... \"Example1\": [1.111, 6.11, 3.11],\n... }\n... )\n>>>\n>>> compare_df = pl.DataFrame(\n... {\n... \"ID\": [\"123456\", \"1234567\", \"1234567810\"],\n... \"Example1\": [1.114, 6.14, 3.12],\n... },\n... )\n>>>\n>>> print(\"With equality_resolution of 0.01\")\nWith equality_resolution of 0.01\n>>> compare_result = compare([\"ID\"], base_df, compare_df, resolution=0.01)\n>>> print(compare_result.values_sample())\nshape: (1, 4)\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 ID \u2506 variable \u2506 base \u2506 compare \u2502\n\u2502 --- \u2506 --- \u2506 --- \u2506 --- \u2502\n\u2502 str \u2506 str \u2506 f64 \u2506 f64 \u2502\n\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n\u2502 1234567 \u2506 Example1 \u2506 6.11 \u2506 6.14 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n>>> print(\"With no equality_resolution\")\nWith no equality_resolution\n>>> compare_result = compare([\"ID\"], base_df, compare_df)\n>>> print(compare_result.values_sample())\nshape: (2, 4)\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 ID \u2506 variable \u2506 base \u2506 compare \u2502\n\u2502 --- \u2506 --- \u2506 --- \u2506 --- \u2502\n\u2502 str \u2506 str \u2506 f64 \u2506 f64 \u2502\n\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n\u2502 123456 \u2506 Example1 \u2506 1.111 \u2506 1.114 \u2502\n\u2502 1234567 \u2506 Example1 \u2506 6.11 \u2506 6.14 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n>>>\n```\n\n\n\n### Example using alias for base and compare dataframes. \n\n\n```python\n>>> import polars as pl\n>>> from pl_compare import compare\n>>>\n>>> base_df = pl.DataFrame(\n... {\n... \"ID\": [\"123456\", \"1234567\", \"12345678\"],\n... \"Example1\": [1, 6, 3],\n... \"Example2\": [\"1\", \"2\", \"3\"],\n... }\n... )\n>>> compare_df = pl.DataFrame(\n... {\n... \"ID\": [\"123456\", \"1234567\", \"1234567810\"],\n... \"Example1\": [1, 2, 3],\n... \"Example2\": [1, 2, 3],\n... \"Example3\": [1, 2, 3],\n... },\n... )\n>>>\n>>> compare_result = compare([\"ID\"],\n... base_df,\n... compare_df,\n... base_alias=\"before_change\",\n... compare_alias=\"after_change\")\n>>>\n>>> print(\"values_summary()\")\nvalues_summary()\n>>> print(compare_result.schemas_sample())\nshape: (2, 3)\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 column \u2506 before_change_format \u2506 after_change_format \u2502\n\u2502 --- \u2506 --- \u2506 --- \u2502\n\u2502 str \u2506 str \u2506 str \u2502\n\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n\u2502 Example2 \u2506 String \u2506 Int64 \u2502\n\u2502 Example3 \u2506 null \u2506 Int64 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n>>> print(\"values_sample()\")\nvalues_sample()\n>>> print(compare_result.values_sample())\nshape: (1, 4)\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 ID \u2506 variable \u2506 before_change \u2506 after_change \u2502\n\u2502 --- \u2506 --- \u2506 --- \u2506 --- \u2502\n\u2502 str \u2506 str \u2506 i64 \u2506 i64 \u2502\n\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n\u2502 1234567 \u2506 Example1 \u2506 6 \u2506 2 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n>>>\n```\n\n\n### Assert two frames are equal for a test \n\n\n```python\n>>> import polars as pl\n>>> import pytest\n>>> from pl_compare.compare import compare\n>>>\n>>> def test_example():\n... base_df = pl.DataFrame(\n... {\n... \"ID\": [\"123456\", \"1234567\", \"12345678\"],\n... \"Example1\": [1, 6, 3],\n... \"Example2\": [1, 2, 3],\n... }\n... )\n... compare_df = pl.DataFrame(\n... {\n... \"ID\": [\"123456\", \"1234567\", \"12345678\"],\n... \"Example1\": [1, 6, 9],\n... \"Example2\": [1, 2, 3],\n... }\n... )\n... comparison = compare([\"ID\"], base_df, compare_df)\n... if not comparison.is_equal():\n... raise Exception(comparison.report())\n...\n>>> test_example() # doctest: +IGNORE_EXCEPTION_DETAIL\nTraceback (most recent call last):\n File \"<stdin>\", line 1, in <module>\n File \"<stdin>\", line 18, in test_example\nException: --------------------------------------------------------------------------------\nCOMPARISON REPORT\n--------------------------------------------------------------------------------\nNo Schema differences found.\n--------------------------------------------------------------------------------\nNo Row differences found (when joining by the supplied id_columns).\n--------------------------------------------------------------------------------\n\nVALUE DIFFERENCES:\nshape: (3, 3)\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Value Differences \u2506 Count \u2506 Percentage \u2502\n\u2502 --- \u2506 --- \u2506 --- \u2502\n\u2502 str \u2506 i64 \u2506 f64 \u2502\n\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n\u2502 Total Value Differences \u2506 1 \u2506 16.666667 \u2502\n\u2502 Example1 \u2506 1 \u2506 33.333333 \u2502\n\u2502 Example2 \u2506 0 \u2506 0.0 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\nshape: (1, 4)\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 ID \u2506 variable \u2506 base \u2506 compare \u2502\n\u2502 --- \u2506 --- \u2506 --- \u2506 --- \u2502\n\u2502 str \u2506 str \u2506 i64 \u2506 i64 \u2502\n\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n\u2502 12345678 \u2506 Example1 \u2506 3 \u2506 9 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n--------------------------------------------------------------------------------\nEnd of Report\n--------------------------------------------------------------------------------\n>>>\n```\n\n\n\n### To DO:\n- [x] Linting (Ruff)\n- [x] Make into python package\n- [x] Add makefile for easy linting and tests\n- [x] Statistics should indicate which statistics are referencing columns\n- [x] Add all statistics frame to tests\n- [x] Add schema differences to schema summary\n- [x] Make row examples alternate between base only and compare only so that it is more readable.\n- [x] Add limit value to the examples.\n- [x] Updated value differences summary so that Statistic is something that makes sense.\n- [x] Publish package to pypi\n- [x] Add difference criterion.\n- [x] Add license\n- [x] Make package easy to use (i.e. so you only have to import pl_compare and then you can us pl_compare)\n- [x] Add table name labels that can replace 'base' and 'compare'.\n- [x] Update code to use a config dataclass that can be passed between the class and functions.\n- [x] Write up docstrings\n- [x] Write up readme (with code examples)\n- [x] Add parameter to hide column differences with 0 differences.\n- [x] Add flag to indicate if there are differences between the tables.\n- [x] Update report so that non differences are not displayed.\n- [x] Seperate out dev dependencies from library dependencies?\n- [x] Change 'threshold' to be equality resolution.\n- [x] strict MyPy type checking\n- [x] Raise error and print examples if duplicates are present.\n- [x] Add total number of value differences to the value differences summary.\n- [x] Add percentage column so the value differences summary.\n- [x] Change id_columns to be named 'join_columns' \n- [x] Github actions for publishing\n- [x] Update the duplication validation.\n- [x] Fix report output when tables are exactly equal.\n- [x] Github actions for testing\n- [x] Github actions for linting\n- [x] Add message when there are no columns left to be compared.\n- [x] Add message when df's are exactly equal. \n- [x] Add test case with exactly equal dfs.\n- [x] Add test case with no columns being compared. Make sure an error is raised when the value_summary and value_sample methods are called.\n- [] Test for large amounts of data\n- [] Benchmark for different sizes of data.\n- [] Investigate use for very large datasets 50GB-100GB. Can this be done using LazyFrames only?\n- [] There still seems to be a bug when converting from lazy to data frame using streaming (i.e. in the convert_to_dataframe function)\n\n## Ideas:\n- [] Simplify custom equality checks and add example.\n- [] Add a count of the number of rows that have any differences to the value differences summary.\n- [] add a test that checks that abritrary join conditions work.\n\n\n",
"bugtrack_url": null,
"license": null,
"summary": "A tool to find the differences between two tables.",
"version": "0.6.0",
"project_urls": {
"Changelog": "https://github.com/concur1/pl_compare/releases",
"Repository": "https://github.com/concur1/pl_compare"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "ad3a22d09d0a9a015420af9b76759af940cf12254a969b0ab2873ebe47d29abd",
"md5": "aac8566a447524b4dab92ee55f8e33f4",
"sha256": "1786a45b7bb7b5f87bd7a5a8f5b41ac13e25de3d01b4414161b50b63877de575"
},
"downloads": -1,
"filename": "pl_compare-0.6.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "aac8566a447524b4dab92ee55f8e33f4",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0,>=3.8",
"size": 15497,
"upload_time": "2024-10-27T15:32:04",
"upload_time_iso_8601": "2024-10-27T15:32:04.449904Z",
"url": "https://files.pythonhosted.org/packages/ad/3a/22d09d0a9a015420af9b76759af940cf12254a969b0ab2873ebe47d29abd/pl_compare-0.6.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "743b0a8daf6bf70d7fe2fb9eaaf6e26f93c4993ba82d3c2bba18c8e309cf778d",
"md5": "6ca4a317c22a08b9bb3f43c3ee51806c",
"sha256": "cbcf66b3661fc7556d1ad22a9d6d3459abd2775da75f0c4f7efe3e0bc8425112"
},
"downloads": -1,
"filename": "pl_compare-0.6.0.tar.gz",
"has_sig": false,
"md5_digest": "6ca4a317c22a08b9bb3f43c3ee51806c",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4.0,>=3.8",
"size": 16560,
"upload_time": "2024-10-27T15:32:05",
"upload_time_iso_8601": "2024-10-27T15:32:05.992433Z",
"url": "https://files.pythonhosted.org/packages/74/3b/0a8daf6bf70d7fe2fb9eaaf6e26f93c4993ba82d3c2bba18c8e309cf778d/pl_compare-0.6.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-10-27 15:32:05",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "concur1",
"github_project": "pl_compare",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "pl-compare"
}