chispa


Namechispa JSON
Version 0.10.1 PyPI version JSON
download
home_pagehttps://github.com/MrPowers/chispa
SummaryPyspark test helper library
upload_time2024-07-31 21:06:41
maintainerSemyon Sinchenko
docs_urlNone
authorMatthew Powers
requires_python<4.0,>=3.8
licenseMIT
keywords apachespark spark pyspark pytest
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # chispa

![![image](https://github.com/MrPowers/chispa/workflows/build/badge.svg)](https://github.com/MrPowers/chispa/actions/workflows/ci.yml/badge.svg)
![PyPI - Downloads](https://img.shields.io/pypi/dm/chispa)
[![PyPI version](https://badge.fury.io/py/chispa.svg)](https://badge.fury.io/py/chispa)

chispa provides fast PySpark test helper methods that output descriptive error messages.

This library makes it easy to write high quality PySpark code.

Fun fact: "chispa" means Spark in Spanish ;)

## Installation

Install the latest version with `pip install chispa`.

If you use Poetry, add this library as a development dependency with `poetry add chispa -G dev`.

## Column equality

Suppose you have a function that removes the non-word characters in a string.

```python
def remove_non_word_characters(col):
    return F.regexp_replace(col, "[^\\w\\s]+", "")
```

Create a `SparkSession` so you can create DataFrames.

```python
from pyspark.sql import SparkSession

spark = (SparkSession.builder
  .master("local")
  .appName("chispa")
  .getOrCreate())
```

Create a DataFrame with a column that contains strings with non-word characters, run the `remove_non_word_characters` function, and check that all these characters are removed with the chispa `assert_column_equality` method.

```python
import pytest

from chispa.column_comparer import assert_column_equality
import pyspark.sql.functions as F

def test_remove_non_word_characters_short():
    data = [
        ("jo&&se", "jose"),
        ("**li**", "li"),
        ("#::luisa", "luisa"),
        (None, None)
    ]
    df = (spark.createDataFrame(data, ["name", "expected_name"])
        .withColumn("clean_name", remove_non_word_characters(F.col("name"))))
    assert_column_equality(df, "clean_name", "expected_name")
```

Let's write another test that'll fail to see how the descriptive error message lets you easily debug the underlying issue.

Here's the failing test:

```python
def test_remove_non_word_characters_nice_error():
    data = [
        ("matt7", "matt"),
        ("bill&", "bill"),
        ("isabela*", "isabela"),
        (None, None)
    ]
    df = (spark.createDataFrame(data, ["name", "expected_name"])
        .withColumn("clean_name", remove_non_word_characters(F.col("name"))))
    assert_column_equality(df, "clean_name", "expected_name")
```

Here's the nicely formatted error message:

![ColumnsNotEqualError](https://github.com/MrPowers/chispa/blob/main/images/columns_not_equal_error.png)

You can see the `matt7` / `matt` row of data is what's causing the error (note it's highlighted in red).  The other rows are colored blue because they're equal.

## DataFrame equality

We can also test the `remove_non_word_characters` method by creating two DataFrames and verifying that they're equal.

Creating two DataFrames is slower and requires more code, but comparing entire DataFrames is necessary for some tests.

```python
from chispa.dataframe_comparer import *

def test_remove_non_word_characters_long():
    source_data = [
        ("jo&&se",),
        ("**li**",),
        ("#::luisa",),
        (None,)
    ]
    source_df = spark.createDataFrame(source_data, ["name"])

    actual_df = source_df.withColumn(
        "clean_name",
        remove_non_word_characters(F.col("name"))
    )

    expected_data = [
        ("jo&&se", "jose"),
        ("**li**", "li"),
        ("#::luisa", "luisa"),
        (None, None)
    ]
    expected_df = spark.createDataFrame(expected_data, ["name", "clean_name"])

    assert_df_equality(actual_df, expected_df)
```

Let's write another test that'll return an error, so you can see the descriptive error message.

```python
def test_remove_non_word_characters_long_error():
    source_data = [
        ("matt7",),
        ("bill&",),
        ("isabela*",),
        (None,)
    ]
    source_df = spark.createDataFrame(source_data, ["name"])

    actual_df = source_df.withColumn(
        "clean_name",
        remove_non_word_characters(F.col("name"))
    )

    expected_data = [
        ("matt7", "matt"),
        ("bill&", "bill"),
        ("isabela*", "isabela"),
        (None, None)
    ]
    expected_df = spark.createDataFrame(expected_data, ["name", "clean_name"])

    assert_df_equality(actual_df, expected_df)
```

Here's the nicely formatted error message:

![DataFramesNotEqualError](https://github.com/MrPowers/chispa/blob/main/images/dfs_not_equal_error.png)

### Ignore row order

You can easily compare DataFrames, ignoring the order of the rows.  The content of the DataFrames is usually what matters, not the order of the rows.

Here are the contents of `df1`:

```
+--------+
|some_num|
+--------+
|       1|
|       2|
|       3|
+--------+
```

Here are the contents of `df2`:

```
+--------+
|some_num|
+--------+
|       2|
|       1|
|       3|
+--------+
```

Here's how to confirm `df1` and `df2` are equal when the row order is ignored.

```python
assert_df_equality(df1, df2, ignore_row_order=True)
```

If you don't specify to `ignore_row_order` then the test will error out with this message:

![ignore_row_order_false](https://github.com/MrPowers/chispa/blob/main/images/ignore_row_order_false.png)

The rows aren't ordered by default because sorting slows down the function.

### Ignore column order

This section explains how to compare DataFrames, ignoring the order of the columns.

Suppose you have the following `df1`:

```
+----+----+
|num1|num2|
+----+----+
|   1|   7|
|   2|   8|
|   3|   9|
+----+----+
```

Here are the contents of `df2`:

```
+----+----+
|num2|num1|
+----+----+
|   7|   1|
|   8|   2|
|   9|   3|
+----+----+
```

Here's how to compare the equality of `df1` and `df2`, ignoring the column order:

```python
assert_df_equality(df1, df2, ignore_column_order=True)
```

Here's the error message you'll see if you run `assert_df_equality(df1, df2)`, without ignoring the column order.

![ignore_column_order_false](https://github.com/MrPowers/chispa/blob/main/images/ignore_column_order_false.png)

### Ignore nullability

Each column in a schema has three properties: a name, data type, and nullable property.  The column can accept null values if `nullable` is set to true.

You'll sometimes want to ignore the nullable property when making DataFrame comparisons.

Suppose you have the following `df1`:

```
+-----+---+
| name|age|
+-----+---+
| juan|  7|
|bruna|  8|
+-----+---+
```

And this `df2`:

```
+-----+---+
| name|age|
+-----+---+
| juan|  7|
|bruna|  8|
+-----+---+
```

You might be surprised to find that in this example, `df1` and `df2` are not equal and will error out with this message:

![nullable_off_error](https://github.com/MrPowers/chispa/blob/main/images/nullable_off_error.png)

Examine the code in this contrived example to better understand the error:

```python
def ignore_nullable_property():
    s1 = StructType([
       StructField("name", StringType(), True),
       StructField("age", IntegerType(), True)])
    df1 = spark.createDataFrame([("juan", 7), ("bruna", 8)], s1)
    s2 = StructType([
       StructField("name", StringType(), True),
       StructField("age", IntegerType(), False)])
    df2 = spark.createDataFrame([("juan", 7), ("bruna", 8)], s2)
    assert_df_equality(df1, df2)
```

You can ignore the nullable property when assessing equality by adding a flag:

```python
assert_df_equality(df1, df2, ignore_nullable=True)
```

Elements contained within an `ArrayType()` also have a nullable property, in addition to the nullable property of the column schema. These are also ignored when passing `ignore_nullable=True`.

Again, examine the following code to understand the error that `ignore_nullable=True` bypasses:

```python
def ignore_nullable_property_array():
    s1 = StructType([
        StructField("name", StringType(), True),
        StructField("coords", ArrayType(DoubleType(), True), True),])
    df1 = spark.createDataFrame([("juan", [1.42, 3.5]), ("bruna", [2.76, 3.2])], s1)
    s2 = StructType([
        StructField("name", StringType(), True),
        StructField("coords", ArrayType(DoubleType(), False), True),])
    df2 = spark.createDataFrame([("juan", [1.42, 3.5]), ("bruna", [2.76, 3.2])], s2)
    assert_df_equality(df1, df2)
```

### Allow NaN equality

Python has NaN (not a number) values and two NaN values are not considered equal by default.  Create two NaN values, compare them, and confirm they're not considered equal by default.

```python
nan1 = float('nan')
nan2 = float('nan')
nan1 == nan2 # False
```

pandas considers NaN values to be equal by default, but this library requires you to set a flag to consider two NaN values to be equal.

```python
assert_df_equality(df1, df2, allow_nan_equality=True)
```

## Customize formatting

You can specify custom formats for the printed error messages as follows:

```python
from chispa import FormattingConfig

formats = FormattingConfig(
        mismatched_rows={"color": "light_yellow"},
        matched_rows={"color": "cyan", "style": "bold"},
        mismatched_cells={"color": "purple"},
        matched_cells={"color": "blue"},
    )

assert_basic_rows_equality(df1.collect(), df2.collect(), formats=formats)
```

or similarly:

```python
from chispa import FormattingConfig, Color, Style

formats = FormattingConfig(
        mismatched_rows={"color": Color.LIGHT_YELLOW},
        matched_rows={"color": Color.CYAN, "style": Style.BOLD},
        mismatched_cells={"color": Color.PURPLE},
        matched_cells={"color": Color.BLUE},
    )

assert_basic_rows_equality(df1.collect(), df2.collect(), formats=formats)
```

You can also define these formats in `conftest.py` and inject them via a fixture:

```python
@pytest.fixture()
def chispa_formats():
    return FormattingConfig(
        mismatched_rows={"color": "light_yellow"},
        matched_rows={"color": "cyan", "style": "bold"},
        mismatched_cells={"color": "purple"},
        matched_cells={"color": "blue"},
    )

def test_shows_assert_basic_rows_equality(chispa_formats):
  ...
  assert_basic_rows_equality(df1.collect(), df2.collect(), formats=chispa_formats)
```

![custom_formats](https://github.com/MrPowers/chispa/blob/main/images/custom_formats.png)

## Approximate column equality

We can check if columns are approximately equal, which is especially useful for floating number comparisons.

Here's a test that creates a DataFrame with two floating point columns and verifies that the columns are approximately equal.  In this example, values are considered approximately equal if the difference is less than 0.1.

```python
def test_approx_col_equality_same():
    data = [
        (1.1, 1.1),
        (2.2, 2.15),
        (3.3, 3.37),
        (None, None)
    ]
    df = spark.createDataFrame(data, ["num1", "num2"])
    assert_approx_column_equality(df, "num1", "num2", 0.1)
```

Here's an example of a test with columns that are not approximately equal.

```python
def test_approx_col_equality_different():
    data = [
        (1.1, 1.1),
        (2.2, 2.15),
        (3.3, 5.0),
        (None, None)
    ]
    df = spark.createDataFrame(data, ["num1", "num2"])
    assert_approx_column_equality(df, "num1", "num2", 0.1)
```

This failing test will output a readable error message so the issue is easy to debug.

![ColumnsNotEqualError](https://github.com/MrPowers/chispa/blob/main/images/columns_not_approx_equal.png)

## Approximate DataFrame equality

Let's create two DataFrames and confirm they're approximately equal.

```python
def test_approx_df_equality_same():
    data1 = [
        (1.1, "a"),
        (2.2, "b"),
        (3.3, "c"),
        (None, None)
    ]
    df1 = spark.createDataFrame(data1, ["num", "letter"])

    data2 = [
        (1.05, "a"),
        (2.13, "b"),
        (3.3, "c"),
        (None, None)
    ]
    df2 = spark.createDataFrame(data2, ["num", "letter"])

    assert_approx_df_equality(df1, df2, 0.1)
```

The `assert_approx_df_equality` method is smart and will only perform approximate equality operations for floating point numbers in DataFrames.  It'll perform regular equality for strings and other types.

Let's perform an approximate equality comparison for two DataFrames that are not equal.

```python
def test_approx_df_equality_different():
    data1 = [
        (1.1, "a"),
        (2.2, "b"),
        (3.3, "c"),
        (None, None)
    ]
    df1 = spark.createDataFrame(data1, ["num", "letter"])

    data2 = [
        (1.1, "a"),
        (5.0, "b"),
        (3.3, "z"),
        (None, None)
    ]
    df2 = spark.createDataFrame(data2, ["num", "letter"])

    assert_approx_df_equality(df1, df2, 0.1)
```

Here's the pretty error message that's outputted:

![DataFramesNotEqualError](https://github.com/MrPowers/chispa/blob/main/images/dfs_not_approx_equal.png)

## Schema mismatch messages

DataFrame equality messages peform schema comparisons before analyzing the actual content of the DataFrames.  DataFrames that don't have the same schemas should error out as fast as possible.

Let's compare a DataFrame that has a string column an integer column with a DataFrame that has two integer columns to observe the schema mismatch message.

```python
def test_schema_mismatch_message():
    data1 = [
        (1, "a"),
        (2, "b"),
        (3, "c"),
        (None, None)
    ]
    df1 = spark.createDataFrame(data1, ["num", "letter"])

    data2 = [
        (1, 6),
        (2, 7),
        (3, 8),
        (None, None)
    ]
    df2 = spark.createDataFrame(data2, ["num", "num2"])

    assert_df_equality(df1, df2)
```

Here's the error message:

![SchemasNotEqualError](https://github.com/MrPowers/chispa/blob/main/images/schemas_not_approx_equal.png)

## Supported PySpark / Python versions

chispa currently supports PySpark 2.4+ and Python 3.5+.

Use chispa v0.8.2 if you're using an older Python version.

PySpark 2 support will be dropped when chispa 1.x is released.

## Benchmarks

TODO: Need to benchmark these methods vs. the spark-testing-base ones

## Developing chispa on your local machine

You are encouraged to clone and/or fork this repo.

This project uses [Poetry](https://python-poetry.org/) for packaging and dependency management.

* Setup the virtual environment with `poetry install`
* Run the tests with `poetry run pytest tests`

Studying the codebase is a great way to learn about PySpark!

## Contributing

Anyone is encouraged to submit a pull request, open an issue, or submit a bug report.

We're happy to promote folks to be library maintainers if they make good contributions.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/MrPowers/chispa",
    "name": "chispa",
    "maintainer": "Semyon Sinchenko",
    "docs_url": null,
    "requires_python": "<4.0,>=3.8",
    "maintainer_email": "ssinchenko@apache.org",
    "keywords": "apachespark, spark, pyspark, pytest",
    "author": "Matthew Powers",
    "author_email": "matthewkevinpowers@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/30/12/3766ad3db49822b6fe83c12876c715f79e2c7a53ab636cf27d87b4be5588/chispa-0.10.1.tar.gz",
    "platform": null,
    "description": "# chispa\n\n![![image](https://github.com/MrPowers/chispa/workflows/build/badge.svg)](https://github.com/MrPowers/chispa/actions/workflows/ci.yml/badge.svg)\n![PyPI - Downloads](https://img.shields.io/pypi/dm/chispa)\n[![PyPI version](https://badge.fury.io/py/chispa.svg)](https://badge.fury.io/py/chispa)\n\nchispa provides fast PySpark test helper methods that output descriptive error messages.\n\nThis library makes it easy to write high quality PySpark code.\n\nFun fact: \"chispa\" means Spark in Spanish ;)\n\n## Installation\n\nInstall the latest version with `pip install chispa`.\n\nIf you use Poetry, add this library as a development dependency with `poetry add chispa -G dev`.\n\n## Column equality\n\nSuppose you have a function that removes the non-word characters in a string.\n\n```python\ndef remove_non_word_characters(col):\n    return F.regexp_replace(col, \"[^\\\\w\\\\s]+\", \"\")\n```\n\nCreate a `SparkSession` so you can create DataFrames.\n\n```python\nfrom pyspark.sql import SparkSession\n\nspark = (SparkSession.builder\n  .master(\"local\")\n  .appName(\"chispa\")\n  .getOrCreate())\n```\n\nCreate a DataFrame with a column that contains strings with non-word characters, run the `remove_non_word_characters` function, and check that all these characters are removed with the chispa `assert_column_equality` method.\n\n```python\nimport pytest\n\nfrom chispa.column_comparer import assert_column_equality\nimport pyspark.sql.functions as F\n\ndef test_remove_non_word_characters_short():\n    data = [\n        (\"jo&&se\", \"jose\"),\n        (\"**li**\", \"li\"),\n        (\"#::luisa\", \"luisa\"),\n        (None, None)\n    ]\n    df = (spark.createDataFrame(data, [\"name\", \"expected_name\"])\n        .withColumn(\"clean_name\", remove_non_word_characters(F.col(\"name\"))))\n    assert_column_equality(df, \"clean_name\", \"expected_name\")\n```\n\nLet's write another test that'll fail to see how the descriptive error message lets you easily debug the underlying issue.\n\nHere's the failing test:\n\n```python\ndef test_remove_non_word_characters_nice_error():\n    data = [\n        (\"matt7\", \"matt\"),\n        (\"bill&\", \"bill\"),\n        (\"isabela*\", \"isabela\"),\n        (None, None)\n    ]\n    df = (spark.createDataFrame(data, [\"name\", \"expected_name\"])\n        .withColumn(\"clean_name\", remove_non_word_characters(F.col(\"name\"))))\n    assert_column_equality(df, \"clean_name\", \"expected_name\")\n```\n\nHere's the nicely formatted error message:\n\n![ColumnsNotEqualError](https://github.com/MrPowers/chispa/blob/main/images/columns_not_equal_error.png)\n\nYou can see the `matt7` / `matt` row of data is what's causing the error (note it's highlighted in red).  The other rows are colored blue because they're equal.\n\n## DataFrame equality\n\nWe can also test the `remove_non_word_characters` method by creating two DataFrames and verifying that they're equal.\n\nCreating two DataFrames is slower and requires more code, but comparing entire DataFrames is necessary for some tests.\n\n```python\nfrom chispa.dataframe_comparer import *\n\ndef test_remove_non_word_characters_long():\n    source_data = [\n        (\"jo&&se\",),\n        (\"**li**\",),\n        (\"#::luisa\",),\n        (None,)\n    ]\n    source_df = spark.createDataFrame(source_data, [\"name\"])\n\n    actual_df = source_df.withColumn(\n        \"clean_name\",\n        remove_non_word_characters(F.col(\"name\"))\n    )\n\n    expected_data = [\n        (\"jo&&se\", \"jose\"),\n        (\"**li**\", \"li\"),\n        (\"#::luisa\", \"luisa\"),\n        (None, None)\n    ]\n    expected_df = spark.createDataFrame(expected_data, [\"name\", \"clean_name\"])\n\n    assert_df_equality(actual_df, expected_df)\n```\n\nLet's write another test that'll return an error, so you can see the descriptive error message.\n\n```python\ndef test_remove_non_word_characters_long_error():\n    source_data = [\n        (\"matt7\",),\n        (\"bill&\",),\n        (\"isabela*\",),\n        (None,)\n    ]\n    source_df = spark.createDataFrame(source_data, [\"name\"])\n\n    actual_df = source_df.withColumn(\n        \"clean_name\",\n        remove_non_word_characters(F.col(\"name\"))\n    )\n\n    expected_data = [\n        (\"matt7\", \"matt\"),\n        (\"bill&\", \"bill\"),\n        (\"isabela*\", \"isabela\"),\n        (None, None)\n    ]\n    expected_df = spark.createDataFrame(expected_data, [\"name\", \"clean_name\"])\n\n    assert_df_equality(actual_df, expected_df)\n```\n\nHere's the nicely formatted error message:\n\n![DataFramesNotEqualError](https://github.com/MrPowers/chispa/blob/main/images/dfs_not_equal_error.png)\n\n### Ignore row order\n\nYou can easily compare DataFrames, ignoring the order of the rows.  The content of the DataFrames is usually what matters, not the order of the rows.\n\nHere are the contents of `df1`:\n\n```\n+--------+\n|some_num|\n+--------+\n|       1|\n|       2|\n|       3|\n+--------+\n```\n\nHere are the contents of `df2`:\n\n```\n+--------+\n|some_num|\n+--------+\n|       2|\n|       1|\n|       3|\n+--------+\n```\n\nHere's how to confirm `df1` and `df2` are equal when the row order is ignored.\n\n```python\nassert_df_equality(df1, df2, ignore_row_order=True)\n```\n\nIf you don't specify to `ignore_row_order` then the test will error out with this message:\n\n![ignore_row_order_false](https://github.com/MrPowers/chispa/blob/main/images/ignore_row_order_false.png)\n\nThe rows aren't ordered by default because sorting slows down the function.\n\n### Ignore column order\n\nThis section explains how to compare DataFrames, ignoring the order of the columns.\n\nSuppose you have the following `df1`:\n\n```\n+----+----+\n|num1|num2|\n+----+----+\n|   1|   7|\n|   2|   8|\n|   3|   9|\n+----+----+\n```\n\nHere are the contents of `df2`:\n\n```\n+----+----+\n|num2|num1|\n+----+----+\n|   7|   1|\n|   8|   2|\n|   9|   3|\n+----+----+\n```\n\nHere's how to compare the equality of `df1` and `df2`, ignoring the column order:\n\n```python\nassert_df_equality(df1, df2, ignore_column_order=True)\n```\n\nHere's the error message you'll see if you run `assert_df_equality(df1, df2)`, without ignoring the column order.\n\n![ignore_column_order_false](https://github.com/MrPowers/chispa/blob/main/images/ignore_column_order_false.png)\n\n### Ignore nullability\n\nEach column in a schema has three properties: a name, data type, and nullable property.  The column can accept null values if `nullable` is set to true.\n\nYou'll sometimes want to ignore the nullable property when making DataFrame comparisons.\n\nSuppose you have the following `df1`:\n\n```\n+-----+---+\n| name|age|\n+-----+---+\n| juan|  7|\n|bruna|  8|\n+-----+---+\n```\n\nAnd this `df2`:\n\n```\n+-----+---+\n| name|age|\n+-----+---+\n| juan|  7|\n|bruna|  8|\n+-----+---+\n```\n\nYou might be surprised to find that in this example, `df1` and `df2` are not equal and will error out with this message:\n\n![nullable_off_error](https://github.com/MrPowers/chispa/blob/main/images/nullable_off_error.png)\n\nExamine the code in this contrived example to better understand the error:\n\n```python\ndef ignore_nullable_property():\n    s1 = StructType([\n       StructField(\"name\", StringType(), True),\n       StructField(\"age\", IntegerType(), True)])\n    df1 = spark.createDataFrame([(\"juan\", 7), (\"bruna\", 8)], s1)\n    s2 = StructType([\n       StructField(\"name\", StringType(), True),\n       StructField(\"age\", IntegerType(), False)])\n    df2 = spark.createDataFrame([(\"juan\", 7), (\"bruna\", 8)], s2)\n    assert_df_equality(df1, df2)\n```\n\nYou can ignore the nullable property when assessing equality by adding a flag:\n\n```python\nassert_df_equality(df1, df2, ignore_nullable=True)\n```\n\nElements contained within an `ArrayType()` also have a nullable property, in addition to the nullable property of the column schema. These are also ignored when passing `ignore_nullable=True`.\n\nAgain, examine the following code to understand the error that `ignore_nullable=True` bypasses:\n\n```python\ndef ignore_nullable_property_array():\n    s1 = StructType([\n        StructField(\"name\", StringType(), True),\n        StructField(\"coords\", ArrayType(DoubleType(), True), True),])\n    df1 = spark.createDataFrame([(\"juan\", [1.42, 3.5]), (\"bruna\", [2.76, 3.2])], s1)\n    s2 = StructType([\n        StructField(\"name\", StringType(), True),\n        StructField(\"coords\", ArrayType(DoubleType(), False), True),])\n    df2 = spark.createDataFrame([(\"juan\", [1.42, 3.5]), (\"bruna\", [2.76, 3.2])], s2)\n    assert_df_equality(df1, df2)\n```\n\n### Allow NaN equality\n\nPython has NaN (not a number) values and two NaN values are not considered equal by default.  Create two NaN values, compare them, and confirm they're not considered equal by default.\n\n```python\nnan1 = float('nan')\nnan2 = float('nan')\nnan1 == nan2 # False\n```\n\npandas considers NaN values to be equal by default, but this library requires you to set a flag to consider two NaN values to be equal.\n\n```python\nassert_df_equality(df1, df2, allow_nan_equality=True)\n```\n\n## Customize formatting\n\nYou can specify custom formats for the printed error messages as follows:\n\n```python\nfrom chispa import FormattingConfig\n\nformats = FormattingConfig(\n        mismatched_rows={\"color\": \"light_yellow\"},\n        matched_rows={\"color\": \"cyan\", \"style\": \"bold\"},\n        mismatched_cells={\"color\": \"purple\"},\n        matched_cells={\"color\": \"blue\"},\n    )\n\nassert_basic_rows_equality(df1.collect(), df2.collect(), formats=formats)\n```\n\nor similarly:\n\n```python\nfrom chispa import FormattingConfig, Color, Style\n\nformats = FormattingConfig(\n        mismatched_rows={\"color\": Color.LIGHT_YELLOW},\n        matched_rows={\"color\": Color.CYAN, \"style\": Style.BOLD},\n        mismatched_cells={\"color\": Color.PURPLE},\n        matched_cells={\"color\": Color.BLUE},\n    )\n\nassert_basic_rows_equality(df1.collect(), df2.collect(), formats=formats)\n```\n\nYou can also define these formats in `conftest.py` and inject them via a fixture:\n\n```python\n@pytest.fixture()\ndef chispa_formats():\n    return FormattingConfig(\n        mismatched_rows={\"color\": \"light_yellow\"},\n        matched_rows={\"color\": \"cyan\", \"style\": \"bold\"},\n        mismatched_cells={\"color\": \"purple\"},\n        matched_cells={\"color\": \"blue\"},\n    )\n\ndef test_shows_assert_basic_rows_equality(chispa_formats):\n  ...\n  assert_basic_rows_equality(df1.collect(), df2.collect(), formats=chispa_formats)\n```\n\n![custom_formats](https://github.com/MrPowers/chispa/blob/main/images/custom_formats.png)\n\n## Approximate column equality\n\nWe can check if columns are approximately equal, which is especially useful for floating number comparisons.\n\nHere's a test that creates a DataFrame with two floating point columns and verifies that the columns are approximately equal.  In this example, values are considered approximately equal if the difference is less than 0.1.\n\n```python\ndef test_approx_col_equality_same():\n    data = [\n        (1.1, 1.1),\n        (2.2, 2.15),\n        (3.3, 3.37),\n        (None, None)\n    ]\n    df = spark.createDataFrame(data, [\"num1\", \"num2\"])\n    assert_approx_column_equality(df, \"num1\", \"num2\", 0.1)\n```\n\nHere's an example of a test with columns that are not approximately equal.\n\n```python\ndef test_approx_col_equality_different():\n    data = [\n        (1.1, 1.1),\n        (2.2, 2.15),\n        (3.3, 5.0),\n        (None, None)\n    ]\n    df = spark.createDataFrame(data, [\"num1\", \"num2\"])\n    assert_approx_column_equality(df, \"num1\", \"num2\", 0.1)\n```\n\nThis failing test will output a readable error message so the issue is easy to debug.\n\n![ColumnsNotEqualError](https://github.com/MrPowers/chispa/blob/main/images/columns_not_approx_equal.png)\n\n## Approximate DataFrame equality\n\nLet's create two DataFrames and confirm they're approximately equal.\n\n```python\ndef test_approx_df_equality_same():\n    data1 = [\n        (1.1, \"a\"),\n        (2.2, \"b\"),\n        (3.3, \"c\"),\n        (None, None)\n    ]\n    df1 = spark.createDataFrame(data1, [\"num\", \"letter\"])\n\n    data2 = [\n        (1.05, \"a\"),\n        (2.13, \"b\"),\n        (3.3, \"c\"),\n        (None, None)\n    ]\n    df2 = spark.createDataFrame(data2, [\"num\", \"letter\"])\n\n    assert_approx_df_equality(df1, df2, 0.1)\n```\n\nThe `assert_approx_df_equality` method is smart and will only perform approximate equality operations for floating point numbers in DataFrames.  It'll perform regular equality for strings and other types.\n\nLet's perform an approximate equality comparison for two DataFrames that are not equal.\n\n```python\ndef test_approx_df_equality_different():\n    data1 = [\n        (1.1, \"a\"),\n        (2.2, \"b\"),\n        (3.3, \"c\"),\n        (None, None)\n    ]\n    df1 = spark.createDataFrame(data1, [\"num\", \"letter\"])\n\n    data2 = [\n        (1.1, \"a\"),\n        (5.0, \"b\"),\n        (3.3, \"z\"),\n        (None, None)\n    ]\n    df2 = spark.createDataFrame(data2, [\"num\", \"letter\"])\n\n    assert_approx_df_equality(df1, df2, 0.1)\n```\n\nHere's the pretty error message that's outputted:\n\n![DataFramesNotEqualError](https://github.com/MrPowers/chispa/blob/main/images/dfs_not_approx_equal.png)\n\n## Schema mismatch messages\n\nDataFrame equality messages peform schema comparisons before analyzing the actual content of the DataFrames.  DataFrames that don't have the same schemas should error out as fast as possible.\n\nLet's compare a DataFrame that has a string column an integer column with a DataFrame that has two integer columns to observe the schema mismatch message.\n\n```python\ndef test_schema_mismatch_message():\n    data1 = [\n        (1, \"a\"),\n        (2, \"b\"),\n        (3, \"c\"),\n        (None, None)\n    ]\n    df1 = spark.createDataFrame(data1, [\"num\", \"letter\"])\n\n    data2 = [\n        (1, 6),\n        (2, 7),\n        (3, 8),\n        (None, None)\n    ]\n    df2 = spark.createDataFrame(data2, [\"num\", \"num2\"])\n\n    assert_df_equality(df1, df2)\n```\n\nHere's the error message:\n\n![SchemasNotEqualError](https://github.com/MrPowers/chispa/blob/main/images/schemas_not_approx_equal.png)\n\n## Supported PySpark / Python versions\n\nchispa currently supports PySpark 2.4+ and Python 3.5+.\n\nUse chispa v0.8.2 if you're using an older Python version.\n\nPySpark 2 support will be dropped when chispa 1.x is released.\n\n## Benchmarks\n\nTODO: Need to benchmark these methods vs. the spark-testing-base ones\n\n## Developing chispa on your local machine\n\nYou are encouraged to clone and/or fork this repo.\n\nThis project uses [Poetry](https://python-poetry.org/) for packaging and dependency management.\n\n* Setup the virtual environment with `poetry install`\n* Run the tests with `poetry run pytest tests`\n\nStudying the codebase is a great way to learn about PySpark!\n\n## Contributing\n\nAnyone is encouraged to submit a pull request, open an issue, or submit a bug report.\n\nWe're happy to promote folks to be library maintainers if they make good contributions.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Pyspark test helper library",
    "version": "0.10.1",
    "project_urls": {
        "Documentation": "https://mrpowers.github.io/chispa",
        "Homepage": "https://github.com/MrPowers/chispa",
        "Repository": "https://github.com/MrPowers/chispa"
    },
    "split_keywords": [
        "apachespark",
        " spark",
        " pyspark",
        " pytest"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "106f48671cd12e5e98d84938230eead59ea6eb60e200d10b83142e02dd9adb52",
                "md5": "b26f33ecd2da5f010414bdebd8b07bf6",
                "sha256": "f040d6eaaa9f6165a31ff675c44cad0778bb260d833f35f023610357b5f9b5ab"
            },
            "downloads": -1,
            "filename": "chispa-0.10.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "b26f33ecd2da5f010414bdebd8b07bf6",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.8",
            "size": 16498,
            "upload_time": "2024-07-31T21:06:39",
            "upload_time_iso_8601": "2024-07-31T21:06:39.185392Z",
            "url": "https://files.pythonhosted.org/packages/10/6f/48671cd12e5e98d84938230eead59ea6eb60e200d10b83142e02dd9adb52/chispa-0.10.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "30123766ad3db49822b6fe83c12876c715f79e2c7a53ab636cf27d87b4be5588",
                "md5": "cb30401717f187b8d97a5769238ef951",
                "sha256": "7ccdbfcc187c3d630efcccc853aa7a7797d3e02a4ee16278c9aeb66fe24c88ca"
            },
            "downloads": -1,
            "filename": "chispa-0.10.1.tar.gz",
            "has_sig": false,
            "md5_digest": "cb30401717f187b8d97a5769238ef951",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.8",
            "size": 16407,
            "upload_time": "2024-07-31T21:06:41",
            "upload_time_iso_8601": "2024-07-31T21:06:41.006140Z",
            "url": "https://files.pythonhosted.org/packages/30/12/3766ad3db49822b6fe83c12876c715f79e2c7a53ab636cf27d87b4be5588/chispa-0.10.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-07-31 21:06:41",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "MrPowers",
    "github_project": "chispa",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "chispa"
}
        
Elapsed time: 0.57014s