spark-expectations


Namespark-expectations JSON
Version 2.2.1 PyPI version JSON
download
home_pageNone
SummaryThis project helps us to run Data Quality Rules in flight while spark job is being run
upload_time2024-11-11 18:11:09
maintainerNone
docs_urlNone
authorAshok Singamaneni
requires_python<3.13,>=3.9
licenseNone
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Spark-Expectations

[![CodeQL](https://github.com/Nike-Inc/spark-expectations/actions/workflows/codeql-analysis.yml/badge.svg)](https://github.com/Nike-Inc/spark-expectations/actions/workflows/codeql-analysis.yml)
[![build](https://github.com/Nike-Inc/spark-expectations/actions/workflows/onpush.yml/badge.svg)](https://github.com/Nike-Inc/spark-expectations/actions/workflows/onpush.yml)
[![codecov](https://codecov.io/gh/Nike-Inc/spark-expectations/branch/main/graph/badge.svg)](https://codecov.io/gh/Nike-Inc/spark-expectations)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![Checked with mypy](http://www.mypy-lang.org/static/mypy_badge.svg)](http://mypy-lang.org/)
[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
![PYPI version](https://img.shields.io/pypi/v/spark-expectations.svg)
![PYPI - Downloads](https://static.pepy.tech/badge/spark-expectations)
![PYPI - Python Version](https://img.shields.io/pypi/pyversions/spark-expectations.svg)

<p align="center">
Spark Expectations is a specialized tool designed with the primary goal of maintaining data integrity within your processing pipeline.
By identifying and preventing malformed or incorrect data from reaching the target destination, it ensues that only quality data is
passed through. Any erroneous records are not simply ignored but are filtered into a separate error table, allowing for 
detailed analysis and reporting. Additionally, Spark Expectations provides valuable statistical data on the filtered content, 
empowering you with insights into your data quality.
</p>

<p align="center">
<img src=https://github.com/Nike-Inc/spark-expectations/blob/main/docs/se_diagrams/logo.png?raw=true width="400" height="400"></p>

---

The documentation for spark-expectations can be found [here](https://engineering.nike.com/spark-expectations/)

### Contributors

Thanks to all the [contributors](https://github.com/Nike-Inc/spark-expectations/blob/main/CONTRIBUTORS.md) who have helped ideate, develop and bring it to its current state 

### Contributing

We're delighted that you're interested in contributing to our project! To get started, 
please carefully read and follow the guidelines provided in our [contributing](https://github.com/Nike-Inc/spark-expectations/blob/main/CONTRIBUTING.md) document

# What is Spark Expectations?
#### Spark Expectations is a Data quality framework built in PySpark as a solution for the following problem statements:

1. The existing data quality tools validates the data in a table at rest and provides the success and error metrics. Users need to manually check the metrics to identify the error records
2. The error data is not quarantined to an error table or there are no corrective actions taken to send only the valid data to downstream
3. Users further downstream must consume the same data incorrectly, or they must perform additional calculations to eliminate records that don't comply with the data quality rules.
4. Another process is required as a corrective action to rectify the errors in the data and lot of planning is usually required for this activity

#### Spark Expectations solves these issues using the following principles:

1. All the records which fail one or more data quality rules, are by default quarantined in an _error table along with the metadata on rules that failed, job information etc. This makes it easier for analysts or product teams to view the incorrect data and collaborate with the teams responsible for correcting and reprocessing it.
2. Aggregated metrics are provided for the raw data and the cleansed data for each run along with the required metadata to prevent recalculation or computation.
3. The data that doesn't meet the data quality contract or the standards is not moved to the next level or iterations unless or otherwise specified. 

---
# Features Of Spark Expectations

Please find the spark-expectations flow and feature diagrams below

<p align="center">
<img src=https://github.com/Nike-Inc/spark-expectations/blob/main/docs/se_diagrams/flow.png?raw=true width=1000></p>

<p align="center">
<img src=https://github.com/Nike-Inc/spark-expectations/blob/main/docs/se_diagrams/features.png?raw=true width=1000></p>


# Spark - Expectations Setup

### Configurations

In order to establish the global configuration parameter for DQ Spark Expectations, you must define and complete the 
required fields within a variable. This involves creating a variable and ensuring that all the necessary information 
is provided in the appropriate fields.

```python
from spark_expectations.config.user_config import Constants as user_config

se_user_conf = {
    user_config.se_notifications_enable_email: False,
    user_config.se_notifications_email_smtp_host: "mailhost.nike.com",
    user_config.se_notifications_email_smtp_port: 25,
    user_config.se_notifications_email_from: "<sender_email_id>",
    user_config.se_notifications_email_to_other_nike_mail_id: "<receiver_email_id's>",
    user_config.se_notifications_email_subject: "spark expectations - data quality - notifications", 
    user_config.se_notifications_enable_slack: True,
    user_config.se_notifications_slack_webhook_url: "<slack-webhook-url>", 
    user_config.se_notifications_on_start: True, 
    user_config.se_notifications_on_completion: True,
    user_config.se_notifications_on_fail: True,
    user_config.se_notifications_on_error_drop_exceeds_threshold_breach: True, 
    user_config.se_notifications_on_error_drop_threshold: 15,
    #Optional
    #Below two params are optional and need to be enabled to capture the detailed stats in the <stats_table_name>_detailed.
    #user_config.enable_query_dq_detailed_result: True,
    #user_config.enable_agg_dq_detailed_result: True,
    
}
```

### Spark Expectations Initialization 

For all the below examples the below import and SparkExpectations class instantiation is mandatory

1. Instantiate `SparkExpectations` class which has all the required functions for running data quality rules

```python
from spark_expectations.core.expectations import SparkExpectations, WrappedDataFrameWriter
from pyspark.sql import SparkSession

spark: SparkSession = SparkSession.builder.getOrCreate()
writer = WrappedDataFrameWriter().mode("append").format("delta")
# writer = WrappedDataFrameWriter().mode("append").format("iceberg")
# product_id should match with the "product_id" in the rules table
se: SparkExpectations = SparkExpectations(
    product_id="your_product",
    rules_df=spark.table("dq_spark_local.dq_rules"),
    stats_table="dq_spark_local.dq_stats",
    stats_table_writer=writer,
    target_and_error_table_writer=writer,
    debugger=False,
    # stats_streaming_options={user_config.se_enable_streaming: False},
)
```

2. Decorate the function with `@se.with_expectations` decorator

```python
from spark_expectations.config.user_config import *
from pyspark.sql import DataFrame
import os


@se.with_expectations(
    target_table="dq_spark_local.customer_order",
    write_to_table=True,
    user_conf=se_user_conf,
    target_table_view="order",
)
def build_new() -> DataFrame:
    # Return the dataframe on which Spark-Expectations needs to be run
    _df_order: DataFrame = (
        spark.read.option("header", "true")
        .option("inferSchema", "true")
        .csv(os.path.join(os.path.dirname(__file__), "resources/order.csv"))
    )
    _df_order.createOrReplaceTempView("order")

    return _df_order 
```


            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "spark-expectations",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<3.13,>=3.9",
    "maintainer_email": null,
    "keywords": null,
    "author": "Ashok Singamaneni",
    "author_email": "ashok.singamaneni@nike.com",
    "download_url": "https://files.pythonhosted.org/packages/ba/d4/0c39b58b78b9d416e3b03bf50efbdb91244379bcfb1db9f722d00d5d0ec9/spark_expectations-2.2.1.tar.gz",
    "platform": null,
    "description": "# Spark-Expectations\n\n[![CodeQL](https://github.com/Nike-Inc/spark-expectations/actions/workflows/codeql-analysis.yml/badge.svg)](https://github.com/Nike-Inc/spark-expectations/actions/workflows/codeql-analysis.yml)\n[![build](https://github.com/Nike-Inc/spark-expectations/actions/workflows/onpush.yml/badge.svg)](https://github.com/Nike-Inc/spark-expectations/actions/workflows/onpush.yml)\n[![codecov](https://codecov.io/gh/Nike-Inc/spark-expectations/branch/main/graph/badge.svg)](https://codecov.io/gh/Nike-Inc/spark-expectations)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n[![Checked with mypy](http://www.mypy-lang.org/static/mypy_badge.svg)](http://mypy-lang.org/)\n[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)\n![PYPI version](https://img.shields.io/pypi/v/spark-expectations.svg)\n![PYPI - Downloads](https://static.pepy.tech/badge/spark-expectations)\n![PYPI - Python Version](https://img.shields.io/pypi/pyversions/spark-expectations.svg)\n\n<p align=\"center\">\nSpark Expectations is a specialized tool designed with the primary goal of maintaining data integrity within your processing pipeline.\nBy identifying and preventing malformed or incorrect data from reaching the target destination, it ensues that only quality data is\npassed through. Any erroneous records are not simply ignored but are filtered into a separate error table, allowing for \ndetailed analysis and reporting. Additionally, Spark Expectations provides valuable statistical data on the filtered content, \nempowering you with insights into your data quality.\n</p>\n\n<p align=\"center\">\n<img src=https://github.com/Nike-Inc/spark-expectations/blob/main/docs/se_diagrams/logo.png?raw=true width=\"400\" height=\"400\"></p>\n\n---\n\nThe documentation for spark-expectations can be found [here](https://engineering.nike.com/spark-expectations/)\n\n### Contributors\n\nThanks to all the [contributors](https://github.com/Nike-Inc/spark-expectations/blob/main/CONTRIBUTORS.md) who have helped ideate, develop and bring it to its current state \n\n### Contributing\n\nWe're delighted that you're interested in contributing to our project! To get started, \nplease carefully read and follow the guidelines provided in our [contributing](https://github.com/Nike-Inc/spark-expectations/blob/main/CONTRIBUTING.md) document\n\n# What is Spark Expectations?\n#### Spark Expectations is a Data quality framework built in PySpark as a solution for the following problem statements:\n\n1. The existing data quality tools validates the data in a table at rest and provides the success and error metrics. Users need to manually check the metrics to identify the error records\n2. The error data is not quarantined to an error table or there are no corrective actions taken to send only the valid data to downstream\n3. Users further downstream must consume the same data incorrectly, or they must perform additional calculations to eliminate records that don't comply with the data quality rules.\n4. Another process is required as a corrective action to rectify the errors in the data and lot of planning is usually required for this activity\n\n#### Spark Expectations solves these issues using the following principles:\n\n1. All the records which fail one or more data quality rules, are by default quarantined in an _error table along with the metadata on rules that failed, job information etc. This makes it easier for analysts or product teams to view the incorrect data and collaborate with the teams responsible for correcting and reprocessing it.\n2. Aggregated metrics are provided for the raw data and the cleansed data for each run along with the required metadata to prevent recalculation or computation.\n3. The data that doesn't meet the data quality contract or the standards is not moved to the next level or iterations unless or otherwise specified. \n\n---\n# Features Of Spark Expectations\n\nPlease find the spark-expectations flow and feature diagrams below\n\n<p align=\"center\">\n<img src=https://github.com/Nike-Inc/spark-expectations/blob/main/docs/se_diagrams/flow.png?raw=true width=1000></p>\n\n<p align=\"center\">\n<img src=https://github.com/Nike-Inc/spark-expectations/blob/main/docs/se_diagrams/features.png?raw=true width=1000></p>\n\n\n# Spark - Expectations Setup\n\n### Configurations\n\nIn order to establish the global configuration parameter for DQ Spark Expectations, you must define and complete the \nrequired fields within a variable. This involves creating a variable and ensuring that all the necessary information \nis provided in the appropriate fields.\n\n```python\nfrom spark_expectations.config.user_config import Constants as user_config\n\nse_user_conf = {\n    user_config.se_notifications_enable_email: False,\n    user_config.se_notifications_email_smtp_host: \"mailhost.nike.com\",\n    user_config.se_notifications_email_smtp_port: 25,\n    user_config.se_notifications_email_from: \"<sender_email_id>\",\n    user_config.se_notifications_email_to_other_nike_mail_id: \"<receiver_email_id's>\",\n    user_config.se_notifications_email_subject: \"spark expectations - data quality - notifications\", \n    user_config.se_notifications_enable_slack: True,\n    user_config.se_notifications_slack_webhook_url: \"<slack-webhook-url>\", \n    user_config.se_notifications_on_start: True, \n    user_config.se_notifications_on_completion: True,\n    user_config.se_notifications_on_fail: True,\n    user_config.se_notifications_on_error_drop_exceeds_threshold_breach: True, \n    user_config.se_notifications_on_error_drop_threshold: 15,\n    #Optional\n    #Below two params are optional and need to be enabled to capture the detailed stats in the <stats_table_name>_detailed.\n    #user_config.enable_query_dq_detailed_result: True,\n    #user_config.enable_agg_dq_detailed_result: True,\n    \n}\n```\n\n### Spark Expectations Initialization \n\nFor all the below examples the below import and SparkExpectations class instantiation is mandatory\n\n1. Instantiate `SparkExpectations` class which has all the required functions for running data quality rules\n\n```python\nfrom spark_expectations.core.expectations import SparkExpectations, WrappedDataFrameWriter\nfrom pyspark.sql import SparkSession\n\nspark: SparkSession = SparkSession.builder.getOrCreate()\nwriter = WrappedDataFrameWriter().mode(\"append\").format(\"delta\")\n# writer = WrappedDataFrameWriter().mode(\"append\").format(\"iceberg\")\n# product_id should match with the \"product_id\" in the rules table\nse: SparkExpectations = SparkExpectations(\n    product_id=\"your_product\",\n    rules_df=spark.table(\"dq_spark_local.dq_rules\"),\n    stats_table=\"dq_spark_local.dq_stats\",\n    stats_table_writer=writer,\n    target_and_error_table_writer=writer,\n    debugger=False,\n    # stats_streaming_options={user_config.se_enable_streaming: False},\n)\n```\n\n2. Decorate the function with `@se.with_expectations` decorator\n\n```python\nfrom spark_expectations.config.user_config import *\nfrom pyspark.sql import DataFrame\nimport os\n\n\n@se.with_expectations(\n    target_table=\"dq_spark_local.customer_order\",\n    write_to_table=True,\n    user_conf=se_user_conf,\n    target_table_view=\"order\",\n)\ndef build_new() -> DataFrame:\n    # Return the dataframe on which Spark-Expectations needs to be run\n    _df_order: DataFrame = (\n        spark.read.option(\"header\", \"true\")\n        .option(\"inferSchema\", \"true\")\n        .csv(os.path.join(os.path.dirname(__file__), \"resources/order.csv\"))\n    )\n    _df_order.createOrReplaceTempView(\"order\")\n\n    return _df_order \n```\n\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "This project helps us to run Data Quality Rules in flight while spark job is being run",
    "version": "2.2.1",
    "project_urls": null,
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "58e7c43b25d71cfe3faf4d0c94adfee5c2ed034fa1ebfc714c8eb04f3b0e81be",
                "md5": "878feac7a0553814c876f8f47aba3d28",
                "sha256": "ca551d21fcfd1452897ecda5acce150b4f70d0ce3412375eaf40fa3116305517"
            },
            "downloads": -1,
            "filename": "spark_expectations-2.2.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "878feac7a0553814c876f8f47aba3d28",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<3.13,>=3.9",
            "size": 1022944,
            "upload_time": "2024-11-11T18:11:07",
            "upload_time_iso_8601": "2024-11-11T18:11:07.783429Z",
            "url": "https://files.pythonhosted.org/packages/58/e7/c43b25d71cfe3faf4d0c94adfee5c2ed034fa1ebfc714c8eb04f3b0e81be/spark_expectations-2.2.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "bad40c39b58b78b9d416e3b03bf50efbdb91244379bcfb1db9f722d00d5d0ec9",
                "md5": "e5448f2540f59f0ad9a111bc7488a86b",
                "sha256": "1618faf22dc7012f338a0932f967c9b7a411536263b2729d0f25c8f7c32bd6b7"
            },
            "downloads": -1,
            "filename": "spark_expectations-2.2.1.tar.gz",
            "has_sig": false,
            "md5_digest": "e5448f2540f59f0ad9a111bc7488a86b",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<3.13,>=3.9",
            "size": 995195,
            "upload_time": "2024-11-11T18:11:09",
            "upload_time_iso_8601": "2024-11-11T18:11:09.772199Z",
            "url": "https://files.pythonhosted.org/packages/ba/d4/0c39b58b78b9d416e3b03bf50efbdb91244379bcfb1db9f722d00d5d0ec9/spark_expectations-2.2.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-11-11 18:11:09",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "spark-expectations"
}
        
Elapsed time: 0.40533s