[![Version](https://img.shields.io/pypi/v/data-validation-framework)](https://github.com/BlueBrain/data-validation-framework/releases)
[![Build status](https://github.com/BlueBrain/data-validation-framework/actions/workflows/run-tox.yml/badge.svg?branch=main)](https://github.com/BlueBrain/data-validation-framework/actions)
[![Coverage](https://codecov.io/github/BlueBrain/data-validation-framework/coverage.svg?branch=main)](https://codecov.io/github/BlueBrain/data-validation-framework?branch=main)
[![License](https://img.shields.io/badge/License-Apache%202-blue)](https://github.com/BlueBrain/data-validation-framework/blob/main/LICENSE.txt)
[![Documentation status](https://readthedocs.org/projects/data-validation-framework/badge/?version=latest)](https://data-validation-framework.readthedocs.io/)
# Data Validation Framework
This project provides simple tools to create data validation workflows.
The workflows are based on the [luigi](https://luigi.readthedocs.io/en/stable) library.
The main objective of this framework is to gather in a same place both the specifications that the
data must follow and the code that actually tests the data. This avoids having multiple documents
to store the specifications and a repository to store the code.
## Installation
This package should be installed using pip:
```bash
pip install data-validation-framework
```
## Usage
### Building a workflow
Building a new workflow is simple, as you can see in the following example:
```python
import luigi
import data_validation_framework as dvf
class ValidationTask1(dvf.task.ElementValidationTask):
"""Use the class docstring to describe the specifications of the ValidationTask1."""
output_columns = {"col_name": None}
@staticmethod
def validation_function(row, output_path, *args, **kwargs):
# Return the validation result for one row of the dataset
if row["col_name"] <= 10:
return dvf.result.ValidationResult(is_valid=True)
else:
return dvf.result.ValidationResult(
is_valid=False,
ret_code=1,
comment="The value should always be <= 10"
)
def external_validation_function(df, output_path, *args, **kwargs):
# Update the dataset inplace here by setting values to the 'is_valid' column.
# The 'ret_code' and 'comment' values are optional, they will be added to the report
# in order to help the user to understand why the dataset did not pass the validation.
# We can use the value from kwargs["param_value"] here.
if int(kwargs["param_value"]) <= 10:
df["is_valid"] = True
else:
df["is_valid"] = False
df["ret_code"] = 1
df["comment"] = "The value should always be <= 10"
class ValidationTask2(dvf.task.SetValidationTask):
"""In some cases you might want to keep the docstring to describe what a developer
needs to know, not the end-user. In this case, you can use the ``__specifications__``
attribute to store the specifications."""
a_parameter = luigi.Parameter()
__specifications__ = """Use the __specifications__ to describe the specifications of the
ValidationTask2."""
def inputs(self):
return {ValidationTask1: {"col_name": "new_col_name_in_current_task"}}
def kwargs(self):
return {"param_value": self.a_parameter}
validation_function = staticmethod(external_validation_function)
class ValidationWorkflow(dvf.task.ValidationWorkflow):
"""Use the global workflow specifications to give general context to the end-user."""
def inputs(self):
return {
ValidationTask1: {},
ValidationTask2: {},
}
```
Where the `ValidationWorkflow` class only defines the sub-tasks that should be called for the
validation. The sub-tasks can be either a `dvf.task.ElementValidationTask` or a
`dvf.task.SetValidationTask`. In both cases, you can define relations between these sub-tasks
since one could need the result of another one to run properly. This is defined in two steps:
1. in the required task, a `output_columns` attribute should be defined so that the next tasks
can know what data is available, as shown in the previous example for the `ValidationTask1`.
2. in the task that requires another task, a `inputs` method should be defined, as shown in the
previous example for the `ValidationTask2`.
The sub-classes of `dvf.task.ElementValidationTask` should return a
`dvf.result.ValidationResult` object. The sub-classes of `dvf.task.SetValidationTask` should
return a `Pandas.DataFrame` object with at least the following columns
`["is_valid", "ret_code", "comment", "exception"]` and with the same index as the input dataset.
## Generate the specifications of a workflow
The specifications that the data should follow can be generated with the following luigi command:
```bash
luigi --module test_validation ValidationWorkflow --log-level INFO --local-scheduler --result-path out --ValidationTask2-a-parameter 15 --specifications-only
```
## Running a workflow
The workflow can be run with the following luigi command (note that the module `test_validation`
must be available in your `sys.path`):
```bash
luigi --module test_validation ValidationWorkflow --log-level INFO --local-scheduler --dataset-df dataset.csv --result-path out --ValidationTask2-a-parameter 15
```
This workflow will generate the following files:
* `out/report_ValidationWorkflow.pdf`: the PDF validation report.
* `out/ValidationTask1/report.csv`: The CSV containing the validity values of the task
`ValidationTask1`.
* `out/ValidationTask2/report.csv`: The CSV containing the validity values of the task
`ValidationTask2`.
* `out/ValidationWorkflow/report.csv`: The CSV containing the validity values of the complete
workflow.
.. note::
As any `luigi <https://luigi.readthedocs.io/en/stable>`_ workflow, the values can be stored
into a `luigi.cfg` file instead of being passed to the CLI.
## Advanced features
### Require a regular Luigi task
In some cases, one want to execute a regular Luigi task in a validation workflow. In this case, it
is possible to use the `extra_requires()` method to pass these extra requirements. In the
validation task it is then possible to get the targets of these extra requirements using the
`extra_input()` method.
```python
class TestTaskA(luigi.Task):
def run(self):
# Do something and write the 'target.file'
def output(self):
return target.OutputLocalTarget("target.file")
class TestTaskB(task.SetValidationTask):
output_columns = {"extra_target_path": None}
def kwargs(self):
return {"extra_task_target_path": self.extra_input().path}
def extra_requires(self):
return TestTaskA()
@staticmethod
def validation_function(df, output_path, *args, **kwargs):
df["is_valid"] = True
df["extra_target_path"] = kwargs["extra_task_target_path"]
```
## Funding & Acknowledgment
The development of this software was supported by funding to the Blue Brain Project, a research
center of the École polytechnique fédérale de Lausanne (EPFL), from the Swiss government’s ETH
Board of the Swiss Federal Institutes of Technology.
For license and authors, see `LICENSE.txt` and `AUTHORS.md` respectively.
Copyright © 2022-2023 Blue Brain Project/EPFL
Raw data
{
"_id": null,
"home_page": "https://data-validation-framework.readthedocs.io",
"name": "data-validation-framework",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": null,
"author": "Blue Brain Project, EPFL",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/23/5c/2713d071758a703b6739a7e508ff9ea7160c6bbf33851a3d4266ab60ed34/data_validation_framework-0.8.0.tar.gz",
"platform": null,
"description": "[![Version](https://img.shields.io/pypi/v/data-validation-framework)](https://github.com/BlueBrain/data-validation-framework/releases)\n[![Build status](https://github.com/BlueBrain/data-validation-framework/actions/workflows/run-tox.yml/badge.svg?branch=main)](https://github.com/BlueBrain/data-validation-framework/actions)\n[![Coverage](https://codecov.io/github/BlueBrain/data-validation-framework/coverage.svg?branch=main)](https://codecov.io/github/BlueBrain/data-validation-framework?branch=main)\n[![License](https://img.shields.io/badge/License-Apache%202-blue)](https://github.com/BlueBrain/data-validation-framework/blob/main/LICENSE.txt)\n[![Documentation status](https://readthedocs.org/projects/data-validation-framework/badge/?version=latest)](https://data-validation-framework.readthedocs.io/)\n\n\n# Data Validation Framework\n\nThis project provides simple tools to create data validation workflows.\nThe workflows are based on the [luigi](https://luigi.readthedocs.io/en/stable) library.\n\nThe main objective of this framework is to gather in a same place both the specifications that the\ndata must follow and the code that actually tests the data. This avoids having multiple documents\nto store the specifications and a repository to store the code.\n\n\n## Installation\n\nThis package should be installed using pip:\n\n```bash\npip install data-validation-framework\n```\n\n## Usage\n\n### Building a workflow\n\nBuilding a new workflow is simple, as you can see in the following example:\n\n```python\nimport luigi\nimport data_validation_framework as dvf\n\n\nclass ValidationTask1(dvf.task.ElementValidationTask):\n \"\"\"Use the class docstring to describe the specifications of the ValidationTask1.\"\"\"\n\n output_columns = {\"col_name\": None}\n\n @staticmethod\n def validation_function(row, output_path, *args, **kwargs):\n # Return the validation result for one row of the dataset\n if row[\"col_name\"] <= 10:\n return dvf.result.ValidationResult(is_valid=True)\n else:\n return dvf.result.ValidationResult(\n is_valid=False,\n ret_code=1,\n comment=\"The value should always be <= 10\"\n )\n\n\ndef external_validation_function(df, output_path, *args, **kwargs):\n # Update the dataset inplace here by setting values to the 'is_valid' column.\n # The 'ret_code' and 'comment' values are optional, they will be added to the report\n # in order to help the user to understand why the dataset did not pass the validation.\n\n # We can use the value from kwargs[\"param_value\"] here.\n if int(kwargs[\"param_value\"]) <= 10:\n df[\"is_valid\"] = True\n else:\n df[\"is_valid\"] = False\n df[\"ret_code\"] = 1\n df[\"comment\"] = \"The value should always be <= 10\"\n\n\nclass ValidationTask2(dvf.task.SetValidationTask):\n \"\"\"In some cases you might want to keep the docstring to describe what a developer\n needs to know, not the end-user. In this case, you can use the ``__specifications__``\n attribute to store the specifications.\"\"\"\n\n a_parameter = luigi.Parameter()\n\n __specifications__ = \"\"\"Use the __specifications__ to describe the specifications of the\n ValidationTask2.\"\"\"\n\n def inputs(self):\n return {ValidationTask1: {\"col_name\": \"new_col_name_in_current_task\"}}\n\n def kwargs(self):\n return {\"param_value\": self.a_parameter}\n\n validation_function = staticmethod(external_validation_function)\n\n\nclass ValidationWorkflow(dvf.task.ValidationWorkflow):\n \"\"\"Use the global workflow specifications to give general context to the end-user.\"\"\"\n\n def inputs(self):\n return {\n ValidationTask1: {},\n ValidationTask2: {},\n }\n```\n\nWhere the `ValidationWorkflow` class only defines the sub-tasks that should be called for the\nvalidation. The sub-tasks can be either a `dvf.task.ElementValidationTask` or a\n`dvf.task.SetValidationTask`. In both cases, you can define relations between these sub-tasks\nsince one could need the result of another one to run properly. This is defined in two steps:\n\n1. in the required task, a `output_columns` attribute should be defined so that the next tasks\n can know what data is available, as shown in the previous example for the `ValidationTask1`.\n2. in the task that requires another task, a `inputs` method should be defined, as shown in the\n previous example for the `ValidationTask2`.\n\nThe sub-classes of `dvf.task.ElementValidationTask` should return a\n`dvf.result.ValidationResult` object. The sub-classes of `dvf.task.SetValidationTask` should\nreturn a `Pandas.DataFrame` object with at least the following columns\n`[\"is_valid\", \"ret_code\", \"comment\", \"exception\"]` and with the same index as the input dataset.\n\n## Generate the specifications of a workflow\n\nThe specifications that the data should follow can be generated with the following luigi command:\n\n```bash\nluigi --module test_validation ValidationWorkflow --log-level INFO --local-scheduler --result-path out --ValidationTask2-a-parameter 15 --specifications-only\n```\n\n## Running a workflow\n\nThe workflow can be run with the following luigi command (note that the module `test_validation`\nmust be available in your `sys.path`):\n\n\n```bash\nluigi --module test_validation ValidationWorkflow --log-level INFO --local-scheduler --dataset-df dataset.csv --result-path out --ValidationTask2-a-parameter 15\n```\n\nThis workflow will generate the following files:\n\n* `out/report_ValidationWorkflow.pdf`: the PDF validation report.\n* `out/ValidationTask1/report.csv`: The CSV containing the validity values of the task\n `ValidationTask1`.\n* `out/ValidationTask2/report.csv`: The CSV containing the validity values of the task\n `ValidationTask2`.\n* `out/ValidationWorkflow/report.csv`: The CSV containing the validity values of the complete\n workflow.\n\n.. note::\n\n As any `luigi <https://luigi.readthedocs.io/en/stable>`_ workflow, the values can be stored\n into a `luigi.cfg` file instead of being passed to the CLI.\n\n## Advanced features\n\n### Require a regular Luigi task\n\nIn some cases, one want to execute a regular Luigi task in a validation workflow. In this case, it\nis possible to use the `extra_requires()` method to pass these extra requirements. In the\nvalidation task it is then possible to get the targets of these extra requirements using the\n`extra_input()` method.\n\n```python\nclass TestTaskA(luigi.Task):\n\n def run(self):\n # Do something and write the 'target.file'\n\n def output(self):\n return target.OutputLocalTarget(\"target.file\")\n\nclass TestTaskB(task.SetValidationTask):\n\n output_columns = {\"extra_target_path\": None}\n\n def kwargs(self):\n return {\"extra_task_target_path\": self.extra_input().path}\n\n def extra_requires(self):\n return TestTaskA()\n\n @staticmethod\n def validation_function(df, output_path, *args, **kwargs):\n df[\"is_valid\"] = True\n df[\"extra_target_path\"] = kwargs[\"extra_task_target_path\"]\n```\n\n## Funding & Acknowledgment\n\nThe development of this software was supported by funding to the Blue Brain Project, a research\ncenter of the \u00c9cole polytechnique f\u00e9d\u00e9rale de Lausanne (EPFL), from the Swiss government\u2019s ETH\nBoard of the Swiss Federal Institutes of Technology.\n\nFor license and authors, see `LICENSE.txt` and `AUTHORS.md` respectively.\n\nCopyright \u00a9 2022-2023 Blue Brain Project/EPFL\n",
"bugtrack_url": null,
"license": "Apache License 2.0",
"summary": "Simple framework to create data validation workflows.",
"version": "0.8.0",
"project_urls": {
"Homepage": "https://data-validation-framework.readthedocs.io",
"Source": "https://github.com/BlueBrain/data-validation-framework",
"Tracker": "https://github.com/BlueBrain/data-validation-framework/issues"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "29ece0bb5d7eb7673ff2108eca4deda0b8d0b96537dc83875b84495ed173b9db",
"md5": "0ac8bd88926de8077c814f2617d31cde",
"sha256": "895d315c819627406b69c53dcc956ff2d4db523fdbe1ab6170db9cf732c85bb8"
},
"downloads": -1,
"filename": "data_validation_framework-0.8.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "0ac8bd88926de8077c814f2617d31cde",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 26079,
"upload_time": "2024-04-26T13:32:30",
"upload_time_iso_8601": "2024-04-26T13:32:30.770233Z",
"url": "https://files.pythonhosted.org/packages/29/ec/e0bb5d7eb7673ff2108eca4deda0b8d0b96537dc83875b84495ed173b9db/data_validation_framework-0.8.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "235c2713d071758a703b6739a7e508ff9ea7160c6bbf33851a3d4266ab60ed34",
"md5": "44c6aada932fb75c3b2ab55b0c49956f",
"sha256": "9c31643e375873dd00218d8a178b07805638284b377547b69022c6ecddef3140"
},
"downloads": -1,
"filename": "data_validation_framework-0.8.0.tar.gz",
"has_sig": false,
"md5_digest": "44c6aada932fb75c3b2ab55b0c49956f",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 196688,
"upload_time": "2024-04-26T13:32:33",
"upload_time_iso_8601": "2024-04-26T13:32:33.003850Z",
"url": "https://files.pythonhosted.org/packages/23/5c/2713d071758a703b6739a7e508ff9ea7160c6bbf33851a3d4266ab60ed34/data_validation_framework-0.8.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-04-26 13:32:33",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "BlueBrain",
"github_project": "data-validation-framework",
"travis_ci": false,
"coveralls": true,
"github_actions": true,
"tox": true,
"lcname": "data-validation-framework"
}