pydeequ

Name	pydeequ JSON
Version	1.3.0 JSON
	download
home_page	https://pydeequ.readthedocs.io
Summary	PyDeequ - Unit Tests for Data
upload_time	2024-04-26 20:35:24
maintainer	Chenyang Liu
docs_url	None
author	Chenyang Liu
requires_python	<4,>=3.7
license	Apache-2.0
keywords	deequ pydeequ data-engineering data-quality data-profiling dataquality dataunittest data-unit-tests data-profilers
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # PyDeequ

PyDeequ is a Python API for [Deequ](https://github.com/awslabs/deequ), a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets. PyDeequ is written to support usage of Deequ in Python.

[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) ![Coverage](https://img.shields.io/badge/coverage-90%25-green)

There are 4 main components of Deequ, and they are:
- Metrics Computation:
    - `Profiles` leverages Analyzers to analyze each column of a dataset.
    - `Analyzers` serve here as a foundational module that computes metrics for data profiling and validation at scale.
- Constraint Suggestion:
    - Specify rules for various groups of Analyzers to be run over a dataset to return back a collection of constraints suggested to run in a Verification Suite.
- Constraint Verification:
    - Perform data validation on a dataset with respect to various constraints set by you.   
- Metrics Repository
    - Allows for persistence and tracking of Deequ runs over time.

![](imgs/pydeequ_architecture.jpg)

## 🎉 Announcements 🎉
- **NEW!!!** 1.1.0 release of Python Deequ has been published to PYPI https://pypi.org/project/pydeequ/. This release brings many recency upgrades including support up to Spark 3.3.0! Any feedbacks are welcome through github issues.
- With PyDeequ v0.1.8+, we now officially support Spark3 ! Just make sure you have an environment variable `SPARK_VERSION` to specify your Spark version! 
- We've release a blogpost on integrating PyDeequ onto AWS leveraging services such as AWS Glue, Athena, and SageMaker! Check it out: [Monitor data quality in your data lake using PyDeequ and AWS Glue](https://aws.amazon.com/blogs/big-data/monitor-data-quality-in-your-data-lake-using-pydeequ-and-aws-glue/).
- Check out the [PyDeequ Release Announcement Blogpost](https://aws.amazon.com/blogs/big-data/testing-data-quality-at-scale-with-pydeequ/) with a tutorial walkthrough the Amazon Reviews dataset!
- Join the PyDeequ community on [PyDeequ Slack](https://join.slack.com/t/pydeequ/shared_invite/zt-te6bntpu-yaqPy7bhiN8Lu0NxpZs47Q) to chat with the devs!

## Quickstart

The following will quickstart you with some basic usage. For more in-depth examples, take a look in the [`tutorials/`](tutorials/) directory for executable Jupyter notebooks of each module. For documentation on supported interfaces, view the [`documentation`](https://pydeequ.readthedocs.io/).

### Installation

You can install [PyDeequ via pip](https://pypi.org/project/pydeequ/).

```
pip install pydeequ
```

### Set up a PySpark session
```python
from pyspark.sql import SparkSession, Row
import pydeequ

spark = (SparkSession
    .builder
    .config("spark.jars.packages", pydeequ.deequ_maven_coord)
    .config("spark.jars.excludes", pydeequ.f2j_maven_coord)
    .getOrCreate())

df = spark.sparkContext.parallelize([
            Row(a="foo", b=1, c=5),
            Row(a="bar", b=2, c=6),
            Row(a="baz", b=3, c=None)]).toDF()
```

### Analyzers

```python
from pydeequ.analyzers import *

analysisResult = AnalysisRunner(spark) \
                    .onData(df) \
                    .addAnalyzer(Size()) \
                    .addAnalyzer(Completeness("b")) \
                    .run()

analysisResult_df = AnalyzerContext.successMetricsAsDataFrame(spark, analysisResult)
analysisResult_df.show()
```

### Profile

```python
from pydeequ.profiles import *

result = ColumnProfilerRunner(spark) \
    .onData(df) \
    .run()

for col, profile in result.profiles.items():
    print(profile)
```

### Constraint Suggestions

```python
from pydeequ.suggestions import *

suggestionResult = ConstraintSuggestionRunner(spark) \
             .onData(df) \
             .addConstraintRule(DEFAULT()) \
             .run()

# Constraint Suggestions in JSON format
print(suggestionResult)
```

### Constraint Verification

```python
from pydeequ.checks import *
from pydeequ.verification import *

check = Check(spark, CheckLevel.Warning, "Review Check")

checkResult = VerificationSuite(spark) \
    .onData(df) \
    .addCheck(
        check.hasSize(lambda x: x >= 3) \
        .hasMin("b", lambda x: x == 0) \
        .isComplete("c")  \
        .isUnique("a")  \
        .isContainedIn("a", ["foo", "bar", "baz"]) \
        .isNonNegative("b")) \
    .run()

checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult)
checkResult_df.show()
```

### Repository

Save to a Metrics Repository by adding the `useRepository()` and `saveOrAppendResult()` calls to your Analysis Runner.
```python
from pydeequ.repository import *
from pydeequ.analyzers import *

metrics_file = FileSystemMetricsRepository.helper_metrics_file(spark, 'metrics.json')
repository = FileSystemMetricsRepository(spark, metrics_file)
key_tags = {'tag': 'pydeequ hello world'}
resultKey = ResultKey(spark, ResultKey.current_milli_time(), key_tags)

analysisResult = AnalysisRunner(spark) \
    .onData(df) \
    .addAnalyzer(ApproxCountDistinct('b')) \
    .useRepository(repository) \
    .saveOrAppendResult(resultKey) \
    .run()
```

To load previous runs, use the `repository` object to load previous results back in.

```python
result_metrep_df = repository.load() \
    .before(ResultKey.current_milli_time()) \
    .forAnalyzers([ApproxCountDistinct('b')]) \
    .getSuccessMetricsAsDataFrame()
```

### Wrapping up

After you've ran your jobs with PyDeequ, be sure to shut down your Spark session to prevent any hanging processes. 

```python
spark.sparkContext._gateway.shutdown_callback_server()
spark.stop()
```

## [Contributing](https://github.com/awslabs/python-deequ/blob/master/CONTRIBUTING.md)
Please refer to the [contributing doc](https://github.com/awslabs/python-deequ/blob/master/CONTRIBUTING.md) for how to contribute to PyDeequ.

## [License](https://github.com/awslabs/python-deequ/blob/master/LICENSE)

This library is licensed under the Apache 2.0 License.

******

## Contributing Developer Setup

1. Setup [SDKMAN](#setup-sdkman)
1. Setup [Java](#setup-java)
1. Setup [Apache Spark](#setup-apache-spark)
1. Install [Poetry](#poetry)
1. Run [tests locally](#running-tests-locally)

### Setup SDKMAN

SDKMAN is a tool for managing parallel Versions of multiple Software Development Kits on any Unix based
system. It provides a convenient command line interface for installing, switching, removing and listing
Candidates. SDKMAN! installs smoothly on Mac OSX, Linux, WSL, Cygwin, etc... Support Bash and ZSH shells. See
documentation on the [SDKMAN! website](https://sdkman.io).

Open your favourite terminal and enter the following:

```bash
$ curl -s https://get.sdkman.io | bash
If the environment needs tweaking for SDKMAN to be installed,
the installer will prompt you accordingly and ask you to restart.

Next, open a new terminal or enter:

$ source "$HOME/.sdkman/bin/sdkman-init.sh"

Lastly, run the following code snippet to ensure that installation succeeded:

$ sdk version
```

### Setup Java

Install Java Now open favourite terminal and enter the following:

```bash
List the AdoptOpenJDK OpenJDK versions
$ sdk list java

To install For Java 11
$ sdk install java 11.0.10.hs-adpt

To install For Java 11
$ sdk install java 8.0.292.hs-adpt
```

### Setup Apache Spark

Install Java Now open favourite terminal and enter the following:

```bash
List the Apache Spark versions:
$ sdk list spark

To install For Spark 3
$ sdk install spark 3.0.2
```

### Poetry

Poetry [Commands](https://python-poetry.org/docs/cli/#search)

```bash
poetry install

poetry update

# --tree: List the dependencies as a tree.
# --latest (-l): Show the latest version.
# --outdated (-o): Show the latest version but only for packages that are outdated.
poetry show -o
```

## Running Tests Locally

Take a look at tests in `tests/dataquality` and `tests/jobs`

```bash
$ poetry run pytest
```

## Running Tests Locally (Docker)

If you have issues installing the dependencies listed above, another way to run the tests and verify your changes is through Docker. There is a Dockerfile that will install the required dependencies and run the tests in a container.

```
docker build . -t spark-3.3-docker-test
docker run spark-3.3-docker-test
```

Raw data

            {
    "_id": null,
    "home_page": "https://pydeequ.readthedocs.io",
    "name": "pydeequ",
    "maintainer": "Chenyang Liu",
    "docs_url": null,
    "requires_python": "<4,>=3.7",
    "maintainer_email": "peterl@amazon.com",
    "keywords": "deequ, pydeequ, data-engineering, data-quality, data-profiling, dataquality, dataunittest, data-unit-tests, data-profilers",
    "author": "Chenyang Liu",
    "author_email": "peterl@amazon.com",
    "download_url": "https://files.pythonhosted.org/packages/20/71/253666c23117ec6e600a36670cb455b675e111703871592b4a0bc7fe9eac/pydeequ-1.3.0.tar.gz",
    "platform": null,
    "description": "# PyDeequ\n\nPyDeequ is a Python API for [Deequ](https://github.com/awslabs/deequ), a library built on top of Apache Spark for defining \"unit tests for data\", which measure data quality in large datasets. PyDeequ is written to support usage of Deequ in Python.\n\n[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) ![Coverage](https://img.shields.io/badge/coverage-90%25-green)\n\nThere are 4 main components of Deequ, and they are:\n- Metrics Computation:\n    - `Profiles` leverages Analyzers to analyze each column of a dataset.\n    - `Analyzers` serve here as a foundational module that computes metrics for data profiling and validation at scale.\n- Constraint Suggestion:\n    - Specify rules for various groups of Analyzers to be run over a dataset to return back a collection of constraints suggested to run in a Verification Suite.\n- Constraint Verification:\n    - Perform data validation on a dataset with respect to various constraints set by you.   \n- Metrics Repository\n    - Allows for persistence and tracking of Deequ runs over time.\n\n![](imgs/pydeequ_architecture.jpg)\n\n## \ud83c\udf89 Announcements \ud83c\udf89\n- **NEW!!!** 1.1.0 release of Python Deequ has been published to PYPI https://pypi.org/project/pydeequ/. This release brings many recency upgrades including support up to Spark 3.3.0! Any feedbacks are welcome through github issues.\n- With PyDeequ v0.1.8+, we now officially support Spark3 ! Just make sure you have an environment variable `SPARK_VERSION` to specify your Spark version! \n- We've release a blogpost on integrating PyDeequ onto AWS leveraging services such as AWS Glue, Athena, and SageMaker! Check it out: [Monitor data quality in your data lake using PyDeequ and AWS Glue](https://aws.amazon.com/blogs/big-data/monitor-data-quality-in-your-data-lake-using-pydeequ-and-aws-glue/).\n- Check out the [PyDeequ Release Announcement Blogpost](https://aws.amazon.com/blogs/big-data/testing-data-quality-at-scale-with-pydeequ/) with a tutorial walkthrough the Amazon Reviews dataset!\n- Join the PyDeequ community on [PyDeequ Slack](https://join.slack.com/t/pydeequ/shared_invite/zt-te6bntpu-yaqPy7bhiN8Lu0NxpZs47Q) to chat with the devs!\n\n## Quickstart\n\nThe following will quickstart you with some basic usage. For more in-depth examples, take a look in the [`tutorials/`](tutorials/) directory for executable Jupyter notebooks of each module. For documentation on supported interfaces, view the [`documentation`](https://pydeequ.readthedocs.io/).\n\n### Installation\n\nYou can install [PyDeequ via pip](https://pypi.org/project/pydeequ/).\n\n```\npip install pydeequ\n```\n\n### Set up a PySpark session\n```python\nfrom pyspark.sql import SparkSession, Row\nimport pydeequ\n\nspark = (SparkSession\n    .builder\n    .config(\"spark.jars.packages\", pydeequ.deequ_maven_coord)\n    .config(\"spark.jars.excludes\", pydeequ.f2j_maven_coord)\n    .getOrCreate())\n\ndf = spark.sparkContext.parallelize([\n            Row(a=\"foo\", b=1, c=5),\n            Row(a=\"bar\", b=2, c=6),\n            Row(a=\"baz\", b=3, c=None)]).toDF()\n```\n\n### Analyzers\n\n```python\nfrom pydeequ.analyzers import *\n\nanalysisResult = AnalysisRunner(spark) \\\n                    .onData(df) \\\n                    .addAnalyzer(Size()) \\\n                    .addAnalyzer(Completeness(\"b\")) \\\n                    .run()\n\nanalysisResult_df = AnalyzerContext.successMetricsAsDataFrame(spark, analysisResult)\nanalysisResult_df.show()\n```\n\n### Profile\n\n```python\nfrom pydeequ.profiles import *\n\nresult = ColumnProfilerRunner(spark) \\\n    .onData(df) \\\n    .run()\n\nfor col, profile in result.profiles.items():\n    print(profile)\n```\n\n### Constraint Suggestions\n\n```python\nfrom pydeequ.suggestions import *\n\nsuggestionResult = ConstraintSuggestionRunner(spark) \\\n             .onData(df) \\\n             .addConstraintRule(DEFAULT()) \\\n             .run()\n\n# Constraint Suggestions in JSON format\nprint(suggestionResult)\n```\n\n### Constraint Verification\n\n```python\nfrom pydeequ.checks import *\nfrom pydeequ.verification import *\n\ncheck = Check(spark, CheckLevel.Warning, \"Review Check\")\n\ncheckResult = VerificationSuite(spark) \\\n    .onData(df) \\\n    .addCheck(\n        check.hasSize(lambda x: x >= 3) \\\n        .hasMin(\"b\", lambda x: x == 0) \\\n        .isComplete(\"c\")  \\\n        .isUnique(\"a\")  \\\n        .isContainedIn(\"a\", [\"foo\", \"bar\", \"baz\"]) \\\n        .isNonNegative(\"b\")) \\\n    .run()\n\ncheckResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult)\ncheckResult_df.show()\n```\n\n### Repository\n\nSave to a Metrics Repository by adding the `useRepository()` and `saveOrAppendResult()` calls to your Analysis Runner.\n```python\nfrom pydeequ.repository import *\nfrom pydeequ.analyzers import *\n\nmetrics_file = FileSystemMetricsRepository.helper_metrics_file(spark, 'metrics.json')\nrepository = FileSystemMetricsRepository(spark, metrics_file)\nkey_tags = {'tag': 'pydeequ hello world'}\nresultKey = ResultKey(spark, ResultKey.current_milli_time(), key_tags)\n\nanalysisResult = AnalysisRunner(spark) \\\n    .onData(df) \\\n    .addAnalyzer(ApproxCountDistinct('b')) \\\n    .useRepository(repository) \\\n    .saveOrAppendResult(resultKey) \\\n    .run()\n```\n\nTo load previous runs, use the `repository` object to load previous results back in.\n\n```python\nresult_metrep_df = repository.load() \\\n    .before(ResultKey.current_milli_time()) \\\n    .forAnalyzers([ApproxCountDistinct('b')]) \\\n    .getSuccessMetricsAsDataFrame()\n```\n\n### Wrapping up\n\nAfter you've ran your jobs with PyDeequ, be sure to shut down your Spark session to prevent any hanging processes. \n\n```python\nspark.sparkContext._gateway.shutdown_callback_server()\nspark.stop()\n```\n\n## [Contributing](https://github.com/awslabs/python-deequ/blob/master/CONTRIBUTING.md)\nPlease refer to the [contributing doc](https://github.com/awslabs/python-deequ/blob/master/CONTRIBUTING.md) for how to contribute to PyDeequ.\n\n## [License](https://github.com/awslabs/python-deequ/blob/master/LICENSE)\n\nThis library is licensed under the Apache 2.0 License.\n\n******\n\n## Contributing Developer Setup\n\n1. Setup [SDKMAN](#setup-sdkman)\n1. Setup [Java](#setup-java)\n1. Setup [Apache Spark](#setup-apache-spark)\n1. Install [Poetry](#poetry)\n1. Run [tests locally](#running-tests-locally)\n\n### Setup SDKMAN\n\nSDKMAN is a tool for managing parallel Versions of multiple Software Development Kits on any Unix based\nsystem. It provides a convenient command line interface for installing, switching, removing and listing\nCandidates. SDKMAN! installs smoothly on Mac OSX, Linux, WSL, Cygwin, etc... Support Bash and ZSH shells. See\ndocumentation on the [SDKMAN! website](https://sdkman.io).\n\nOpen your favourite terminal and enter the following:\n\n```bash\n$ curl -s https://get.sdkman.io | bash\nIf the environment needs tweaking for SDKMAN to be installed,\nthe installer will prompt you accordingly and ask you to restart.\n\nNext, open a new terminal or enter:\n\n$ source \"$HOME/.sdkman/bin/sdkman-init.sh\"\n\nLastly, run the following code snippet to ensure that installation succeeded:\n\n$ sdk version\n```\n\n### Setup Java\n\nInstall Java Now open favourite terminal and enter the following:\n\n```bash\nList the AdoptOpenJDK OpenJDK versions\n$ sdk list java\n\nTo install For Java 11\n$ sdk install java 11.0.10.hs-adpt\n\nTo install For Java 11\n$ sdk install java 8.0.292.hs-adpt\n```\n\n### Setup Apache Spark\n\nInstall Java Now open favourite terminal and enter the following:\n\n```bash\nList the Apache Spark versions:\n$ sdk list spark\n\nTo install For Spark 3\n$ sdk install spark 3.0.2\n```\n\n### Poetry\n\nPoetry [Commands](https://python-poetry.org/docs/cli/#search)\n\n```bash\npoetry install\n\npoetry update\n\n# --tree: List the dependencies as a tree.\n# --latest (-l): Show the latest version.\n# --outdated (-o): Show the latest version but only for packages that are outdated.\npoetry show -o\n```\n\n## Running Tests Locally\n\nTake a look at tests in `tests/dataquality` and `tests/jobs`\n\n```bash\n$ poetry run pytest\n```\n\n## Running Tests Locally (Docker)\n\nIf you have issues installing the dependencies listed above, another way to run the tests and verify your changes is through Docker. There is a Dockerfile that will install the required dependencies and run the tests in a container.\n\n```\ndocker build . -t spark-3.3-docker-test\ndocker run spark-3.3-docker-test\n```\n\n\n",
    "bugtrack_url": null,
    "license": "Apache-2.0",
    "summary": "PyDeequ - Unit Tests for Data",
    "version": "1.3.0",
    "project_urls": {
        "Documentation": "https://pydeequ.readthedocs.io",
        "Homepage": "https://pydeequ.readthedocs.io",
        "Repository": "https://github.com/awslabs/python-deequ"
    },
    "split_keywords": [
        "deequ",
        " pydeequ",
        " data-engineering",
        " data-quality",
        " data-profiling",
        " dataquality",
        " dataunittest",
        " data-unit-tests",
        " data-profilers"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "e53b7ebaa95cf3a1510dc2262f775703afe33f787a78d0b57f01f8eb93441fcc",
                "md5": "e7a452b235f9948b7201cbb98f59691b",
                "sha256": "3021e9b276ce93b3bbcad48135323484ceea6f1be9b678484d853b501f0b2d63"
            },
            "downloads": -1,
            "filename": "pydeequ-1.3.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "e7a452b235f9948b7201cbb98f59691b",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4,>=3.7",
            "size": 37611,
            "upload_time": "2024-04-26T20:35:23",
            "upload_time_iso_8601": "2024-04-26T20:35:23.055903Z",
            "url": "https://files.pythonhosted.org/packages/e5/3b/7ebaa95cf3a1510dc2262f775703afe33f787a78d0b57f01f8eb93441fcc/pydeequ-1.3.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "2071253666c23117ec6e600a36670cb455b675e111703871592b4a0bc7fe9eac",
                "md5": "09bbbf94992d6be9c813b4c534b643f2",
                "sha256": "b186b0b47b16172de0f7debddab959f4921e7012a3f048c1dc10a23bf22739f2"
            },
            "downloads": -1,
            "filename": "pydeequ-1.3.0.tar.gz",
            "has_sig": false,
            "md5_digest": "09bbbf94992d6be9c813b4c534b643f2",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4,>=3.7",
            "size": 36102,
            "upload_time": "2024-04-26T20:35:24",
            "upload_time_iso_8601": "2024-04-26T20:35:24.237985Z",
            "url": "https://files.pythonhosted.org/packages/20/71/253666c23117ec6e600a36670cb455b675e111703871592b4a0bc7fe9eac/pydeequ-1.3.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-04-26 20:35:24",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "awslabs",
    "github_project": "python-deequ",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "pydeequ"
}

Chenyang Liu