![Build Status](

# pyspark-testframework

⏳ **Work in progress**


The goal of the `pyspark-testframework` is to provide a simple way to create tests for PySpark DataFrames. The test results are returned in DataFrame format as well.

# Tutorial

**Let's first create an example pyspark DataFrame**

The data will contain the primary keys, street names and house numbers of some addresses.

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
from pyspark.sql import functions as F

# Initialize Spark session
spark = SparkSession.builder.appName("CreateDataFrame").getOrCreate()

# Define the schema
schema = StructType(
        StructField("primary_key", IntegerType(), True),
        StructField("street", StringType(), True),
        StructField("house_number", IntegerType(), True),

# Define the data
data = [
    (1, "Rochussenstraat", 27),
    (2, "Coolsingel", 31),
    (3, "%Witte de Withstraat", 27),
    (4, "Lijnbaan", -3),
    (5, None, 13),

df = spark.createDataFrame(data, schema)

    |primary_key|street              |house_number|
    |1          |Rochussenstraat     |27          |
    |2          |Coolsingel          |31          |
    |3          |%Witte de Withstraat|27          |
    |4          |Lijnbaan            |-3          |
    |5          |null                |13          |

**Import and initialize the `DataFrameTester`**

from testframework.dataquality import DataFrameTester

df_tester = DataFrameTester(

**Import configurable tests**

from testframework.dataquality.tests import ValidNumericRange, RegexTest

**Initialize the `RegexTest` to test for valid street names**

valid_street_name = RegexTest(
    pattern=r"^[A-Z][a-zéèáàëï]*([ -][A-Z]?[a-zéèáàëï]*)*$",

**Run `valid_street_name` on the _street_ column using the `.test()` method of `DataFrameTester`.**

    nullable=False,  # nullable, hence null values are converted to True
    description="street contains valid Dutch street name.",

    |primary_key|street              |street__ValidStreetName|
    |1          |Rochussenstraat     |true                   |
    |2          |Coolsingel          |true                   |
    |3          |%Witte de Withstraat|false                  |
    |4          |Lijnbaan            |true                   |
    |5          |null                |false                  |

**Run the `IntegerString` test on the _number_ column**

    nullable=True,  # nullable, hence null values are converted to True
    # description is optional, let's not define it for illustration purposes

    |          1|          27|                           true|
    |          2|          31|                           true|
    |          3|          27|                           true|
    |          4|          -3|                          false|
    |          5|          13|                           true|

**Let's take a look at the test results of the DataFrame using the `.results` attribute.**


    |1          |true                   |true                           |
    |2          |true                   |true                           |
    |3          |false                  |true                           |
    |4          |true                   |false                          |
    |5          |false                  |true                           |

**We can use `.descriptions` or `.descriptions_df` to get the descriptions of the tests.**

This can be useful for reporting purposes.   
For example to create reports for the business with more detailed information than just the column name and the test name.


    {'street__ValidStreetName': 'street contains valid Dutch street name.',
     'house_number__ValidNumericRange': 'house_number__ValidNumericRange(min_value=0.0, max_value=inf)'}


    |test                           |description                                                  |
    |street__ValidStreetName        |street contains valid Dutch street name.                     |
    |house_number__ValidNumericRange|house_number__ValidNumericRange(min_value=0.0, max_value=inf)|

### Custom tests

Sometimes tests are too specific or complex to be covered by the configurable tests. That's why we can create custom tests and add them to the `DataFrameTester` object.

Let's do this using a custom test which should tests that every house has a bath room. We'll start by creating a new DataFrame with rooms rather than houses.

rooms = [
    (1, "living room"),
    (1, "bath room"),
    (1, "kitchen"),
    (1, "bed room"),
    (2, "living room"),
    (2, "bed room"),
    (2, "kitchen"),

schema_rooms = StructType(
        StructField("primary_key", IntegerType(), True),
        StructField("room", StringType(), True),

room_df = spark.createDataFrame(rooms, schema=schema_rooms)

    |primary_key|room       |
    |1          |living room|
    |1          |bath room  |
    |1          |kitchen    |
    |1          |bed room   |
    |2          |living room|
    |2          |bed room   |
    |2          |kitchen    |

To create a custom test, we should create a pyspark DataFrame which contains the same primary_key column as the DataFrame to be tested using the `DataFrameTester`.

Let's create a boolean column that indicates whether the house has a bath room or not.

house_has_bath_room = room_df.groupBy("primary_key").agg(
    F.max(F.when(F.col("room") == "bath room", 1).otherwise(0)).alias("has_bath_room")

    |1          |1            |
    |2          |0            |

**We can add this 'custom test' to the `DataFrameTester` using `add_custom_test_result`.**

In the background, all kinds of data validation checks are done by `DataFrameTester` to make sure that it fits the requirements to be added to the other test results.

    description="House has a bath room",
    # fillna_value=0, # optional; by default null.

    |1          |1            |
    |2          |0            |
    |3          |null         |
    |4          |null         |
    |5          |null         |

**Despite that the data whether a house has a bath room is not available in the house DataFrame; we can still add the custom test to the `DataFrameTester` object.**


    |1          |true                   |true                           |1            |
    |2          |true                   |true                           |0            |
    |3          |false                  |true                           |null         |
    |4          |true                   |false                          |null         |
    |5          |false                  |true                           |null         |


    {'street__ValidStreetName': 'street contains valid Dutch street name.',
     'house_number__ValidNumericRange': 'house_number__ValidNumericRange(min_value=0.0, max_value=inf)',
     'has_bath_room': 'House has a bath room'}


