# PyDeequ
PyDeequ is a Python API for [Deequ](https://github.com/awslabs/deequ), a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets. PyDeequ is written to support usage of Deequ in Python.
[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) ![Coverage](https://img.shields.io/badge/coverage-90%25-green)
There are 4 main components of Deequ, and they are:
- Metrics Computation:
- `Profiles` leverages Analyzers to analyze each column of a dataset.
- `Analyzers` serve here as a foundational module that computes metrics for data profiling and validation at scale.
- Constraint Suggestion:
- Specify rules for various groups of Analyzers to be run over a dataset to return back a collection of constraints suggested to run in a Verification Suite.
- Constraint Verification:
- Perform data validation on a dataset with respect to various constraints set by you.
- Metrics Repository
- Allows for persistence and tracking of Deequ runs over time.
![](imgs/pydeequ_architecture.jpg)
## 🎉 Announcements 🎉
- **NEW!!!** 1.1.0 release of Python Deequ has been published to PYPI https://pypi.org/project/pydeequ/. This release brings many recency upgrades including support up to Spark 3.3.0! Any feedbacks are welcome through github issues.
- With PyDeequ v0.1.8+, we now officially support Spark3 ! Just make sure you have an environment variable `SPARK_VERSION` to specify your Spark version!
- We've release a blogpost on integrating PyDeequ onto AWS leveraging services such as AWS Glue, Athena, and SageMaker! Check it out: [Monitor data quality in your data lake using PyDeequ and AWS Glue](https://aws.amazon.com/blogs/big-data/monitor-data-quality-in-your-data-lake-using-pydeequ-and-aws-glue/).
- Check out the [PyDeequ Release Announcement Blogpost](https://aws.amazon.com/blogs/big-data/testing-data-quality-at-scale-with-pydeequ/) with a tutorial walkthrough the Amazon Reviews dataset!
- Join the PyDeequ community on [PyDeequ Slack](https://join.slack.com/t/pydeequ/shared_invite/zt-te6bntpu-yaqPy7bhiN8Lu0NxpZs47Q) to chat with the devs!
## Quickstart
The following will quickstart you with some basic usage. For more in-depth examples, take a look in the [`tutorials/`](tutorials/) directory for executable Jupyter notebooks of each module. For documentation on supported interfaces, view the [`documentation`](https://pydeequ.readthedocs.io/).
### Installation
You can install [PyDeequ via pip](https://pypi.org/project/pydeequ/).
```
pip install pydeequ
```
### Set up a PySpark session
```python
from pyspark.sql import SparkSession, Row
import pydeequ
spark = (SparkSession
.builder
.config("spark.jars.packages", pydeequ.deequ_maven_coord)
.config("spark.jars.excludes", pydeequ.f2j_maven_coord)
.getOrCreate())
df = spark.sparkContext.parallelize([
Row(a="foo", b=1, c=5),
Row(a="bar", b=2, c=6),
Row(a="baz", b=3, c=None)]).toDF()
```
### Analyzers
```python
from pydeequ.analyzers import *
analysisResult = AnalysisRunner(spark) \
.onData(df) \
.addAnalyzer(Size()) \
.addAnalyzer(Completeness("b")) \
.run()
analysisResult_df = AnalyzerContext.successMetricsAsDataFrame(spark, analysisResult)
analysisResult_df.show()
```
### Profile
```python
from pydeequ.profiles import *
result = ColumnProfilerRunner(spark) \
.onData(df) \
.run()
for col, profile in result.profiles.items():
print(profile)
```
### Constraint Suggestions
```python
from pydeequ.suggestions import *
suggestionResult = ConstraintSuggestionRunner(spark) \
.onData(df) \
.addConstraintRule(DEFAULT()) \
.run()
# Constraint Suggestions in JSON format
print(suggestionResult)
```
### Constraint Verification
```python
from pydeequ.checks import *
from pydeequ.verification import *
check = Check(spark, CheckLevel.Warning, "Review Check")
checkResult = VerificationSuite(spark) \
.onData(df) \
.addCheck(
check.hasSize(lambda x: x >= 3) \
.hasMin("b", lambda x: x == 0) \
.isComplete("c") \
.isUnique("a") \
.isContainedIn("a", ["foo", "bar", "baz"]) \
.isNonNegative("b")) \
.run()
checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult)
checkResult_df.show()
```
### Repository
Save to a Metrics Repository by adding the `useRepository()` and `saveOrAppendResult()` calls to your Analysis Runner.
```python
from pydeequ.repository import *
from pydeequ.analyzers import *
metrics_file = FileSystemMetricsRepository.helper_metrics_file(spark, 'metrics.json')
repository = FileSystemMetricsRepository(spark, metrics_file)
key_tags = {'tag': 'pydeequ hello world'}
resultKey = ResultKey(spark, ResultKey.current_milli_time(), key_tags)
analysisResult = AnalysisRunner(spark) \
.onData(df) \
.addAnalyzer(ApproxCountDistinct('b')) \
.useRepository(repository) \
.saveOrAppendResult(resultKey) \
.run()
```
To load previous runs, use the `repository` object to load previous results back in.
```python
result_metrep_df = repository.load() \
.before(ResultKey.current_milli_time()) \
.forAnalyzers([ApproxCountDistinct('b')]) \
.getSuccessMetricsAsDataFrame()
```
### Wrapping up
After you've ran your jobs with PyDeequ, be sure to shut down your Spark session to prevent any hanging processes.
```python
spark.sparkContext._gateway.shutdown_callback_server()
spark.stop()
```
## [Contributing](https://github.com/awslabs/python-deequ/blob/master/CONTRIBUTING.md)
Please refer to the [contributing doc](https://github.com/awslabs/python-deequ/blob/master/CONTRIBUTING.md) for how to contribute to PyDeequ.
## [License](https://github.com/awslabs/python-deequ/blob/master/LICENSE)
This library is licensed under the Apache 2.0 License.
******
## Contributing Developer Setup
1. Setup [SDKMAN](#setup-sdkman)
1. Setup [Java](#setup-java)
1. Setup [Apache Spark](#setup-apache-spark)
1. Install [Poetry](#poetry)
1. Run [tests locally](#running-tests-locally)
### Setup SDKMAN
SDKMAN is a tool for managing parallel Versions of multiple Software Development Kits on any Unix based
system. It provides a convenient command line interface for installing, switching, removing and listing
Candidates. SDKMAN! installs smoothly on Mac OSX, Linux, WSL, Cygwin, etc... Support Bash and ZSH shells. See
documentation on the [SDKMAN! website](https://sdkman.io).
Open your favourite terminal and enter the following:
```bash
$ curl -s https://get.sdkman.io | bash
If the environment needs tweaking for SDKMAN to be installed,
the installer will prompt you accordingly and ask you to restart.
Next, open a new terminal or enter:
$ source "$HOME/.sdkman/bin/sdkman-init.sh"
Lastly, run the following code snippet to ensure that installation succeeded:
$ sdk version
```
### Setup Java
Install Java Now open favourite terminal and enter the following:
```bash
List the AdoptOpenJDK OpenJDK versions
$ sdk list java
To install For Java 11
$ sdk install java 11.0.10.hs-adpt
To install For Java 11
$ sdk install java 8.0.292.hs-adpt
```
### Setup Apache Spark
Install Java Now open favourite terminal and enter the following:
```bash
List the Apache Spark versions:
$ sdk list spark
To install For Spark 3
$ sdk install spark 3.0.2
```
### Poetry
Poetry [Commands](https://python-poetry.org/docs/cli/#search)
```bash
poetry install
poetry update
# --tree: List the dependencies as a tree.
# --latest (-l): Show the latest version.
# --outdated (-o): Show the latest version but only for packages that are outdated.
poetry show -o
```
## Running Tests Locally
Take a look at tests in `tests/dataquality` and `tests/jobs`
```bash
$ poetry run pytest
```
## Running Tests Locally (Docker)
If you have issues installing the dependencies listed above, another way to run the tests and verify your changes is through Docker. There is a Dockerfile that will install the required dependencies and run the tests in a container.
```
docker build . -t spark-3.3-docker-test
docker run spark-3.3-docker-test
```
Raw data
{
"_id": null,
"home_page": "https://pydeequ.readthedocs.io",
"name": "pydeequ",
"maintainer": "Chenyang Liu",
"docs_url": null,
"requires_python": "<4,>=3.7",
"maintainer_email": "peterl@amazon.com",
"keywords": "deequ, pydeequ, data-engineering, data-quality, data-profiling, dataquality, dataunittest, data-unit-tests, data-profilers",
"author": "Chenyang Liu",
"author_email": "peterl@amazon.com",
"download_url": "https://files.pythonhosted.org/packages/20/71/253666c23117ec6e600a36670cb455b675e111703871592b4a0bc7fe9eac/pydeequ-1.3.0.tar.gz",
"platform": null,
"description": "# PyDeequ\n\nPyDeequ is a Python API for [Deequ](https://github.com/awslabs/deequ), a library built on top of Apache Spark for defining \"unit tests for data\", which measure data quality in large datasets. PyDeequ is written to support usage of Deequ in Python.\n\n[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) ![Coverage](https://img.shields.io/badge/coverage-90%25-green)\n\nThere are 4 main components of Deequ, and they are:\n- Metrics Computation:\n - `Profiles` leverages Analyzers to analyze each column of a dataset.\n - `Analyzers` serve here as a foundational module that computes metrics for data profiling and validation at scale.\n- Constraint Suggestion:\n - Specify rules for various groups of Analyzers to be run over a dataset to return back a collection of constraints suggested to run in a Verification Suite.\n- Constraint Verification:\n - Perform data validation on a dataset with respect to various constraints set by you. \n- Metrics Repository\n - Allows for persistence and tracking of Deequ runs over time.\n\n![](imgs/pydeequ_architecture.jpg)\n\n## \ud83c\udf89 Announcements \ud83c\udf89\n- **NEW!!!** 1.1.0 release of Python Deequ has been published to PYPI https://pypi.org/project/pydeequ/. This release brings many recency upgrades including support up to Spark 3.3.0! Any feedbacks are welcome through github issues.\n- With PyDeequ v0.1.8+, we now officially support Spark3 ! Just make sure you have an environment variable `SPARK_VERSION` to specify your Spark version! \n- We've release a blogpost on integrating PyDeequ onto AWS leveraging services such as AWS Glue, Athena, and SageMaker! Check it out: [Monitor data quality in your data lake using PyDeequ and AWS Glue](https://aws.amazon.com/blogs/big-data/monitor-data-quality-in-your-data-lake-using-pydeequ-and-aws-glue/).\n- Check out the [PyDeequ Release Announcement Blogpost](https://aws.amazon.com/blogs/big-data/testing-data-quality-at-scale-with-pydeequ/) with a tutorial walkthrough the Amazon Reviews dataset!\n- Join the PyDeequ community on [PyDeequ Slack](https://join.slack.com/t/pydeequ/shared_invite/zt-te6bntpu-yaqPy7bhiN8Lu0NxpZs47Q) to chat with the devs!\n\n## Quickstart\n\nThe following will quickstart you with some basic usage. For more in-depth examples, take a look in the [`tutorials/`](tutorials/) directory for executable Jupyter notebooks of each module. For documentation on supported interfaces, view the [`documentation`](https://pydeequ.readthedocs.io/).\n\n### Installation\n\nYou can install [PyDeequ via pip](https://pypi.org/project/pydeequ/).\n\n```\npip install pydeequ\n```\n\n### Set up a PySpark session\n```python\nfrom pyspark.sql import SparkSession, Row\nimport pydeequ\n\nspark = (SparkSession\n .builder\n .config(\"spark.jars.packages\", pydeequ.deequ_maven_coord)\n .config(\"spark.jars.excludes\", pydeequ.f2j_maven_coord)\n .getOrCreate())\n\ndf = spark.sparkContext.parallelize([\n Row(a=\"foo\", b=1, c=5),\n Row(a=\"bar\", b=2, c=6),\n Row(a=\"baz\", b=3, c=None)]).toDF()\n```\n\n### Analyzers\n\n```python\nfrom pydeequ.analyzers import *\n\nanalysisResult = AnalysisRunner(spark) \\\n .onData(df) \\\n .addAnalyzer(Size()) \\\n .addAnalyzer(Completeness(\"b\")) \\\n .run()\n\nanalysisResult_df = AnalyzerContext.successMetricsAsDataFrame(spark, analysisResult)\nanalysisResult_df.show()\n```\n\n### Profile\n\n```python\nfrom pydeequ.profiles import *\n\nresult = ColumnProfilerRunner(spark) \\\n .onData(df) \\\n .run()\n\nfor col, profile in result.profiles.items():\n print(profile)\n```\n\n### Constraint Suggestions\n\n```python\nfrom pydeequ.suggestions import *\n\nsuggestionResult = ConstraintSuggestionRunner(spark) \\\n .onData(df) \\\n .addConstraintRule(DEFAULT()) \\\n .run()\n\n# Constraint Suggestions in JSON format\nprint(suggestionResult)\n```\n\n### Constraint Verification\n\n```python\nfrom pydeequ.checks import *\nfrom pydeequ.verification import *\n\ncheck = Check(spark, CheckLevel.Warning, \"Review Check\")\n\ncheckResult = VerificationSuite(spark) \\\n .onData(df) \\\n .addCheck(\n check.hasSize(lambda x: x >= 3) \\\n .hasMin(\"b\", lambda x: x == 0) \\\n .isComplete(\"c\") \\\n .isUnique(\"a\") \\\n .isContainedIn(\"a\", [\"foo\", \"bar\", \"baz\"]) \\\n .isNonNegative(\"b\")) \\\n .run()\n\ncheckResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult)\ncheckResult_df.show()\n```\n\n### Repository\n\nSave to a Metrics Repository by adding the `useRepository()` and `saveOrAppendResult()` calls to your Analysis Runner.\n```python\nfrom pydeequ.repository import *\nfrom pydeequ.analyzers import *\n\nmetrics_file = FileSystemMetricsRepository.helper_metrics_file(spark, 'metrics.json')\nrepository = FileSystemMetricsRepository(spark, metrics_file)\nkey_tags = {'tag': 'pydeequ hello world'}\nresultKey = ResultKey(spark, ResultKey.current_milli_time(), key_tags)\n\nanalysisResult = AnalysisRunner(spark) \\\n .onData(df) \\\n .addAnalyzer(ApproxCountDistinct('b')) \\\n .useRepository(repository) \\\n .saveOrAppendResult(resultKey) \\\n .run()\n```\n\nTo load previous runs, use the `repository` object to load previous results back in.\n\n```python\nresult_metrep_df = repository.load() \\\n .before(ResultKey.current_milli_time()) \\\n .forAnalyzers([ApproxCountDistinct('b')]) \\\n .getSuccessMetricsAsDataFrame()\n```\n\n### Wrapping up\n\nAfter you've ran your jobs with PyDeequ, be sure to shut down your Spark session to prevent any hanging processes. \n\n```python\nspark.sparkContext._gateway.shutdown_callback_server()\nspark.stop()\n```\n\n## [Contributing](https://github.com/awslabs/python-deequ/blob/master/CONTRIBUTING.md)\nPlease refer to the [contributing doc](https://github.com/awslabs/python-deequ/blob/master/CONTRIBUTING.md) for how to contribute to PyDeequ.\n\n## [License](https://github.com/awslabs/python-deequ/blob/master/LICENSE)\n\nThis library is licensed under the Apache 2.0 License.\n\n******\n\n## Contributing Developer Setup\n\n1. Setup [SDKMAN](#setup-sdkman)\n1. Setup [Java](#setup-java)\n1. Setup [Apache Spark](#setup-apache-spark)\n1. Install [Poetry](#poetry)\n1. Run [tests locally](#running-tests-locally)\n\n### Setup SDKMAN\n\nSDKMAN is a tool for managing parallel Versions of multiple Software Development Kits on any Unix based\nsystem. It provides a convenient command line interface for installing, switching, removing and listing\nCandidates. SDKMAN! installs smoothly on Mac OSX, Linux, WSL, Cygwin, etc... Support Bash and ZSH shells. See\ndocumentation on the [SDKMAN! website](https://sdkman.io).\n\nOpen your favourite terminal and enter the following:\n\n```bash\n$ curl -s https://get.sdkman.io | bash\nIf the environment needs tweaking for SDKMAN to be installed,\nthe installer will prompt you accordingly and ask you to restart.\n\nNext, open a new terminal or enter:\n\n$ source \"$HOME/.sdkman/bin/sdkman-init.sh\"\n\nLastly, run the following code snippet to ensure that installation succeeded:\n\n$ sdk version\n```\n\n### Setup Java\n\nInstall Java Now open favourite terminal and enter the following:\n\n```bash\nList the AdoptOpenJDK OpenJDK versions\n$ sdk list java\n\nTo install For Java 11\n$ sdk install java 11.0.10.hs-adpt\n\nTo install For Java 11\n$ sdk install java 8.0.292.hs-adpt\n```\n\n### Setup Apache Spark\n\nInstall Java Now open favourite terminal and enter the following:\n\n```bash\nList the Apache Spark versions:\n$ sdk list spark\n\nTo install For Spark 3\n$ sdk install spark 3.0.2\n```\n\n### Poetry\n\nPoetry [Commands](https://python-poetry.org/docs/cli/#search)\n\n```bash\npoetry install\n\npoetry update\n\n# --tree: List the dependencies as a tree.\n# --latest (-l): Show the latest version.\n# --outdated (-o): Show the latest version but only for packages that are outdated.\npoetry show -o\n```\n\n## Running Tests Locally\n\nTake a look at tests in `tests/dataquality` and `tests/jobs`\n\n```bash\n$ poetry run pytest\n```\n\n## Running Tests Locally (Docker)\n\nIf you have issues installing the dependencies listed above, another way to run the tests and verify your changes is through Docker. There is a Dockerfile that will install the required dependencies and run the tests in a container.\n\n```\ndocker build . -t spark-3.3-docker-test\ndocker run spark-3.3-docker-test\n```\n\n\n",
"bugtrack_url": null,
"license": "Apache-2.0",
"summary": "PyDeequ - Unit Tests for Data",
"version": "1.3.0",
"project_urls": {
"Documentation": "https://pydeequ.readthedocs.io",
"Homepage": "https://pydeequ.readthedocs.io",
"Repository": "https://github.com/awslabs/python-deequ"
},
"split_keywords": [
"deequ",
" pydeequ",
" data-engineering",
" data-quality",
" data-profiling",
" dataquality",
" dataunittest",
" data-unit-tests",
" data-profilers"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "e53b7ebaa95cf3a1510dc2262f775703afe33f787a78d0b57f01f8eb93441fcc",
"md5": "e7a452b235f9948b7201cbb98f59691b",
"sha256": "3021e9b276ce93b3bbcad48135323484ceea6f1be9b678484d853b501f0b2d63"
},
"downloads": -1,
"filename": "pydeequ-1.3.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "e7a452b235f9948b7201cbb98f59691b",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4,>=3.7",
"size": 37611,
"upload_time": "2024-04-26T20:35:23",
"upload_time_iso_8601": "2024-04-26T20:35:23.055903Z",
"url": "https://files.pythonhosted.org/packages/e5/3b/7ebaa95cf3a1510dc2262f775703afe33f787a78d0b57f01f8eb93441fcc/pydeequ-1.3.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "2071253666c23117ec6e600a36670cb455b675e111703871592b4a0bc7fe9eac",
"md5": "09bbbf94992d6be9c813b4c534b643f2",
"sha256": "b186b0b47b16172de0f7debddab959f4921e7012a3f048c1dc10a23bf22739f2"
},
"downloads": -1,
"filename": "pydeequ-1.3.0.tar.gz",
"has_sig": false,
"md5_digest": "09bbbf94992d6be9c813b4c534b643f2",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4,>=3.7",
"size": 36102,
"upload_time": "2024-04-26T20:35:24",
"upload_time_iso_8601": "2024-04-26T20:35:24.237985Z",
"url": "https://files.pythonhosted.org/packages/20/71/253666c23117ec6e600a36670cb455b675e111703871592b4a0bc7fe9eac/pydeequ-1.3.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-04-26 20:35:24",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "awslabs",
"github_project": "python-deequ",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "pydeequ"
}