data-quality-validation-pydeequ


Namedata-quality-validation-pydeequ JSON
Version 1.2 PyPI version JSON
download
home_pageNone
SummaryA library for data quality validation using PyDeequ and to send email notification.
upload_time2024-05-13 10:28:50
maintainerNone
docs_urlNone
authorKetan Kirange
requires_python>=3.6
licenseNone
keywords data quality validation pydeequ
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <span style='color: Pink; font-size:25px'>  **Data Quality Validation** </span>

This package is designed for performing data quality validation using PyDeequ.  
It enables users to validate the quality of their data, identifying any potential issues that may affect its suitability for processing or analysis.  
Also, to send email notification about the validation result.

**Author**: Ketan Kirange

**Contributors**: Ketan Kirange, Ajay Rahul Raja

This package contains tools and utilities for performing data quality checks on data files in 
 - Pandas, 
 - Dask, and 
 - PySpark formats, leveraging libraries such as PyDeequ and SODA utilities.

These checks help ensure the integrity, accuracy, and completeness of the data, essential for robust data-driven decision-making processes.

<span style='color: Pink; font-size:25px'> **Importance of Data Quality** </span>

Data quality plays a pivotal role in any engineering project, especially in data science, reporting, and analysis.  

Here's why ensuring high data quality is crucial:

<span style='color: Pink; font-size:25px'> 1. Reliable Insights </span>

High-quality data leads to reliable and trustworthy insights.  
When the data is accurate, complete, and consistent, data scientists and analysts can make informed decisions confidently.

<span style='color: Pink; font-size:25px'> 2. Trustworthy Models </span>

Data quality directly impacts the performance and reliability of machine learning models.  
Models trained on low-quality data may produce biased or inaccurate predictions, leading to unreliable outcomes.

<span style='color: Pink; font-size:25px'> 3. Effective Reporting </span>

Quality data is fundamental for generating accurate reports and visualizations.  
Analysts and stakeholders rely on these reports for understanding trends, identifying patterns, and making strategic decisions.  
Poor data quality can lead to misleading reports and flawed interpretations.

<span style='color: Pink; font-size:25px'> 4. Regulatory Compliance </span>

In many industries, compliance with regulations such as GDPR, HIPAA, or industry-specific standards is mandatory.  
Ensuring data quality is essential for meeting these regulatory requirements and avoiding potential legal consequences.

<span style='color: Pink; font-size:25px'> **Data Quality Validation Tools** </span>

This repository provides a set of tools and utilities to perform comprehensive data quality validation on various data formats:

- **Pandas**: Data quality checks for data stored in Pandas DataFrames, including checks for missing values, data types, and statistical summaries.
- **Dask**: Scalable data quality checks for large-scale datasets using Dask, ensuring consistency and accuracy across distributed computing environments.
- **PySpark with PyDeequ**: Integration with PyDeequ, enabling data quality validation on data processed using PySpark, including checks for schema validation, data distribution, and anomaly detection.
- **SODA Utilities**: Utilities for validating data quality using SODA (Scalable Observations of Data Attributes) framework, allowing for automated quality checks and anomaly detection.

<span style='color: Pink; font-size:25px'> **Getting Started** </span>

<span style='color: Pink; font-size:25px'> **Contributing** </span>

We welcome contributions from the community to enhance and expand the capabilities of this data quality validation repository.  
Please refer to the [contribution guidelines](link-to-contribution-guidelines) for more information on how to contribute.




<br></br>
<span style="font-size:13pt; color:orange">**Prerequisites:**</span>

- Step 1: Download Java, Python, and Apache Spark.    
Having the appropriate versions is essential to run the code on a local system.  

<span style="font-size:11pt; color:green"> **Java:** </span>     [Java 1.8 Archive Downloads](https://www.oracle.com/uk/java/technologies/javase/javase8-archive-downloads.html)

<span style="font-size:11pt; color:green"> **Python:** </span> [Python 3.9.18 Release](https://www.python.org/downloads/release/python-390/)

<span style="font-size:11pt; color:green"> **Apache Spark:** </span> [Apache Spark 3.3.0 Release](https://spark.apache.org/releases/spark-release-3-3-0.html)

- Step 2: Install PyDeequ in the terminal if you encounter an error related to "PyDeequ module is not installed on the machine."

<span style="font-size:13pt; color:orange"> **How to install PyDeequ? Use the following command:** </span>  
  `pip install pydeequ`

- Step 3: Install our ‘Data Quality Validation’ python library in terminal.  
  `pip install data-quality-validation-pydeequ`

- Step 4: To run the Data Quality Validation function, import the library as below:  
  `from dqv.dqv_pydeequ import DqvPydeequ, sendEmailNotification`  

- Step 5: Create a config file in a folder with the columns that need to be validated.  
  Name the file as you wish, but remember to use the name in the DqvPydeequ function.

- Step 6: Upload your data to S3 and save it in a new directory if you are running locally.

- Step 7: Pass your source and target file paths in the DqvPydeequ function.

   ```
   DqvPydeequ(
        "", #config_file
        "", #source_data_path
        "") #target_data_path
   ```

- Step 8: Run the file to validate.

- Step 9: After validating the data, the result can be sent via email.  
  For this, import the library as below:  
  `from email_notification import sendEmailNotification`

- Step 10: save your aws administration imputs, sender email, and region as dictionary.
  Then pass source path, target path and this config in the sendEmailNotification function.

   ```
   email_config = {
    "aws_access_key_id": "", #aws administration
    "aws_secret_access_key": "", #aws administration
    "aws_session_token": "", #aws administration
    "sender_email": "", #sender email
    "receiver_email": "", #receiver email 
    "aws_region": "" #region
    }
   ```

  ``` 
   send_notification = sendEmailNotification(
        "", #source_data_path
        "", #target_data_path
        email_config #dictionary
    )
   ```  


<br></br>
Refer this repo to follow the structure of config file format  <br></br>
Git: <a href="https://github.com/dataruk/data-quality-validation">https://github.com/dataruk/data-quality-validation</a>

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "data-quality-validation-pydeequ",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": null,
    "keywords": "data quality validation pydeequ",
    "author": "Ketan Kirange",
    "author_email": "k.kirange@reply.com",
    "download_url": "https://files.pythonhosted.org/packages/ee/3d/beca35e170ff61fce9d9928c62b44f526518d6d01442d42e6f8cec75f162/data-quality-validation-pydeequ-1.2.tar.gz",
    "platform": null,
    "description": "<span style='color: Pink; font-size:25px'>  **Data Quality Validation** </span>\n\nThis package is designed for performing data quality validation using PyDeequ.  \nIt enables users to validate the quality of their data, identifying any potential issues that may affect its suitability for processing or analysis.  \nAlso, to send email notification about the validation result.\n\n**Author**: Ketan Kirange\n\n**Contributors**: Ketan Kirange, Ajay Rahul Raja\n\nThis package contains tools and utilities for performing data quality checks on data files in \n - Pandas, \n - Dask, and \n - PySpark formats, leveraging libraries such as PyDeequ and SODA utilities.\n\nThese checks help ensure the integrity, accuracy, and completeness of the data, essential for robust data-driven decision-making processes.\n\n<span style='color: Pink; font-size:25px'> **Importance of Data Quality** </span>\n\nData quality plays a pivotal role in any engineering project, especially in data science, reporting, and analysis.  \n\nHere's why ensuring high data quality is crucial:\n\n<span style='color: Pink; font-size:25px'> 1. Reliable Insights </span>\n\nHigh-quality data leads to reliable and trustworthy insights.  \nWhen the data is accurate, complete, and consistent, data scientists and analysts can make informed decisions confidently.\n\n<span style='color: Pink; font-size:25px'> 2. Trustworthy Models </span>\n\nData quality directly impacts the performance and reliability of machine learning models.  \nModels trained on low-quality data may produce biased or inaccurate predictions, leading to unreliable outcomes.\n\n<span style='color: Pink; font-size:25px'> 3. Effective Reporting </span>\n\nQuality data is fundamental for generating accurate reports and visualizations.  \nAnalysts and stakeholders rely on these reports for understanding trends, identifying patterns, and making strategic decisions.  \nPoor data quality can lead to misleading reports and flawed interpretations.\n\n<span style='color: Pink; font-size:25px'> 4. Regulatory Compliance </span>\n\nIn many industries, compliance with regulations such as GDPR, HIPAA, or industry-specific standards is mandatory.  \nEnsuring data quality is essential for meeting these regulatory requirements and avoiding potential legal consequences.\n\n<span style='color: Pink; font-size:25px'> **Data Quality Validation Tools** </span>\n\nThis repository provides a set of tools and utilities to perform comprehensive data quality validation on various data formats:\n\n- **Pandas**: Data quality checks for data stored in Pandas DataFrames, including checks for missing values, data types, and statistical summaries.\n- **Dask**: Scalable data quality checks for large-scale datasets using Dask, ensuring consistency and accuracy across distributed computing environments.\n- **PySpark with PyDeequ**: Integration with PyDeequ, enabling data quality validation on data processed using PySpark, including checks for schema validation, data distribution, and anomaly detection.\n- **SODA Utilities**: Utilities for validating data quality using SODA (Scalable Observations of Data Attributes) framework, allowing for automated quality checks and anomaly detection.\n\n<span style='color: Pink; font-size:25px'> **Getting Started** </span>\n\n<span style='color: Pink; font-size:25px'> **Contributing** </span>\n\nWe welcome contributions from the community to enhance and expand the capabilities of this data quality validation repository.  \nPlease refer to the [contribution guidelines](link-to-contribution-guidelines) for more information on how to contribute.\n\n\n\n\n<br></br>\n<span style=\"font-size:13pt; color:orange\">**Prerequisites:**</span>\n\n- Step 1: Download Java, Python, and Apache Spark.    \nHaving the appropriate versions is essential to run the code on a local system.  \n\n<span style=\"font-size:11pt; color:green\"> **Java:** </span>     [Java 1.8 Archive Downloads](https://www.oracle.com/uk/java/technologies/javase/javase8-archive-downloads.html)\n\n<span style=\"font-size:11pt; color:green\"> **Python:** </span> [Python 3.9.18 Release](https://www.python.org/downloads/release/python-390/)\n\n<span style=\"font-size:11pt; color:green\"> **Apache Spark:** </span> [Apache Spark 3.3.0 Release](https://spark.apache.org/releases/spark-release-3-3-0.html)\n\n- Step 2: Install PyDeequ in the terminal if you encounter an error related to \"PyDeequ module is not installed on the machine.\"\n\n<span style=\"font-size:13pt; color:orange\"> **How to install PyDeequ? Use the following command:** </span>  \n  `pip install pydeequ`\n\n- Step 3: Install our \u2018Data Quality Validation\u2019 python library in terminal.  \n  `pip install data-quality-validation-pydeequ`\n\n- Step 4: To run the Data Quality Validation function, import the library as below:  \n  `from dqv.dqv_pydeequ import DqvPydeequ, sendEmailNotification`  \n\n- Step 5: Create a config file in a folder with the columns that need to be validated.  \n  Name the file as you wish, but remember to use the name in the DqvPydeequ function.\n\n- Step 6: Upload your data to S3 and save it in a new directory if you are running locally.\n\n- Step 7: Pass your source and target file paths in the DqvPydeequ function.\n\n   ```\n   DqvPydeequ(\n        \"\", #config_file\n        \"\", #source_data_path\n        \"\") #target_data_path\n   ```\n\n- Step 8: Run the file to validate.\n\n- Step 9: After validating the data, the result can be sent via email.  \n  For this, import the library as below:  \n  `from email_notification import sendEmailNotification`\n\n- Step 10: save your aws administration imputs, sender email, and region as dictionary.\n  Then pass source path, target path and this config in the sendEmailNotification function.\n\n   ```\n   email_config = {\n    \"aws_access_key_id\": \"\", #aws administration\n    \"aws_secret_access_key\": \"\", #aws administration\n    \"aws_session_token\": \"\", #aws administration\n    \"sender_email\": \"\", #sender email\n    \"receiver_email\": \"\", #receiver email \n    \"aws_region\": \"\" #region\n    }\n   ```\n\n  ``` \n   send_notification = sendEmailNotification(\n        \"\", #source_data_path\n        \"\", #target_data_path\n        email_config #dictionary\n    )\n   ```  \n\n\n<br></br>\nRefer this repo to follow the structure of config file format  <br></br>\nGit: <a href=\"https://github.com/dataruk/data-quality-validation\">https://github.com/dataruk/data-quality-validation</a>\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A library for data quality validation using PyDeequ and to send email notification.",
    "version": "1.2",
    "project_urls": null,
    "split_keywords": [
        "data",
        "quality",
        "validation",
        "pydeequ"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "e307189965ae6f405017a314b0b8dbeccacd650b81d173d406f85db36a50677f",
                "md5": "da01649e2f973bd48ca962dbfe28ec36",
                "sha256": "a85a2dcf9bf3977075b6b78a4c414e36a292f06b8a18287662cde8ebe5a39c93"
            },
            "downloads": -1,
            "filename": "data_quality_validation_pydeequ-1.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "da01649e2f973bd48ca962dbfe28ec36",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6",
            "size": 11039,
            "upload_time": "2024-05-13T10:29:05",
            "upload_time_iso_8601": "2024-05-13T10:29:05.054122Z",
            "url": "https://files.pythonhosted.org/packages/e3/07/189965ae6f405017a314b0b8dbeccacd650b81d173d406f85db36a50677f/data_quality_validation_pydeequ-1.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ee3dbeca35e170ff61fce9d9928c62b44f526518d6d01442d42e6f8cec75f162",
                "md5": "0d24625371aba85ee90d703936ffe66f",
                "sha256": "057b88198d65a255643fbe268e7c4b200b7e8f09c5d8da7bdc36142f74fe1aab"
            },
            "downloads": -1,
            "filename": "data-quality-validation-pydeequ-1.2.tar.gz",
            "has_sig": false,
            "md5_digest": "0d24625371aba85ee90d703936ffe66f",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 8336,
            "upload_time": "2024-05-13T10:28:50",
            "upload_time_iso_8601": "2024-05-13T10:28:50.482410Z",
            "url": "https://files.pythonhosted.org/packages/ee/3d/beca35e170ff61fce9d9928c62b44f526518d6d01442d42e6f8cec75f162/data-quality-validation-pydeequ-1.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-05-13 10:28:50",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "data-quality-validation-pydeequ"
}
        
Elapsed time: 0.34829s