# DataRules
## Goal and motivation
The idea of this project is to define rules to validate and correct datasets.
Whenever possible, it does this in a vectorized way, which makes this library fast.
Reasons to make this:
- Implement an alternative to https://github.com/data-cleaning/ based on python and pandas.
- Implement both validation and correction. Most existing packages provide validation only.
- Support a rule based way of data processing. The rules can be maintained in a separate file (python or yaml) if required.
- Apply vectorization to make processing fast.
## Usage
This package provides two operations on data:
- checks (if data is correct). Also knows as validations.
- corrections (how to fix incorrect data)
### Checks
In checks.py
```python
from datarules import check
@check(tags=["P1"])
def check_almost_square(width, height):
return (width - height).abs() <= 4
@check(tags=["P3", "completeness"])
def check_not_too_deep(depth):
return depth <= 2
```
In your main code:
```python
import pandas as pd
from datarules import CheckList
df = pd.DataFrame([
{"width": 3, "height": 7},
{"width": 3, "height": 5, "depth": 1},
{"width": 3, "height": 8},
{"width": 3, "height": 3},
{"width": 3, "height": -2, "depth": 4},
])
checks = CheckList.from_file('checks.py')
report = checks.run(df)
print(report)
```
Output:
```
name condition items passes fails NAs error warnings
0 check_almost_square check_almost_square(width, height) 5 3 2 0 None 0
1 check_not_too_deep check_not_too_deep(depth) 5 1 4 0 None 0
```
### Corrections
In corrections.py
```python
from datarules import correction
from checks import check_almost_square
@correction(condition=check_almost_square.fails)
def make_square(width, height):
return {"height": height + (width - height) / 2}
```
In your main code:
```python
from datarules import CorrectionList
corrections = CorrectionList.from_file('corrections.py')
report = corrections.run(df)
print(report)
```
Output:
```
name condition action applied error warnings
0 make_square check_almost_square.fails(width, height) make_square(width, height) 2 None 0
```
## Similar work (python)
These work on pandas, but only do validation:
- [Pandera](https://pandera.readthedocs.io/en/stable/index.html) - Like us, their checks are also vectorized.
- [Pandantic](https://github.com/wesselhuising/pandantic) - Combination of validation and parsing based on [pydantic](https://docs.pydantic.dev/latest/).
The following offer validation only, but none of them seem to be vectorized or support pandas directly.
- [Great Expectations](https://github.com/great-expectations/great_expectations) - An overengineered library for validation that has confusing documentation.
- [contessa](https://github.com/kiwicom/contessa) - Meant to be used against databases.
- [validator](https://github.com/CSenshi/Validator)
- [python-valid8](https://github.com/smarie/python-valid8)
- [pyruler](https://github.com/danteay/pyruler) - Dead project that is rule-based.
- [pyrules](https://github.com/miraculixx/pyrules) - Dead project that supports rule based corrections (but no validation).
## Similar work (R)
This project is inspired by https://github.com/data-cleaning/.
Similar functionality can be found in the following R packages:
- [validate](https://github.com/data-cleaning/validate) - Checking data (implemented)
- [dcmodify](https://github.com/data-cleaning/dcmodify) - Correcting data (implemented)
- [errorlocate](https://github.com/data-cleaning/errorlocate) - Identifying and removing errors (not yet implemented)
- [deductive](https://github.com/data-cleaning/deductive) - Deductivate correction based on checks (not yet implemented)
Features found in one of the packages above but not implemented here, might eventually make it into this package too.
Raw data
{
"_id": null,
"home_page": null,
"name": "datarules",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": "rules, validation, checks, correction, data-editing, data-cleaning, data-cleansing",
"author": null,
"author_email": "lverweijen <lauwerund@gmail.com>",
"download_url": null,
"platform": null,
"description": "# DataRules\r\n\r\n## Goal and motivation\r\n\r\nThe idea of this project is to define rules to validate and correct datasets.\r\nWhenever possible, it does this in a vectorized way, which makes this library fast.\r\n\r\n\r\nReasons to make this:\r\n- Implement an alternative to https://github.com/data-cleaning/ based on python and pandas.\r\n- Implement both validation and correction. Most existing packages provide validation only.\r\n- Support a rule based way of data processing. The rules can be maintained in a separate file (python or yaml) if required.\r\n- Apply vectorization to make processing fast.\r\n\r\n## Usage\r\n\r\nThis package provides two operations on data:\r\n\r\n- checks (if data is correct). Also knows as validations.\r\n- corrections (how to fix incorrect data)\r\n\r\n### Checks\r\n\r\nIn checks.py\r\n\r\n```python\r\nfrom datarules import check\r\n\r\n\r\n@check(tags=[\"P1\"])\r\ndef check_almost_square(width, height):\r\n return (width - height).abs() <= 4\r\n\r\n\r\n@check(tags=[\"P3\", \"completeness\"])\r\ndef check_not_too_deep(depth):\r\n return depth <= 2\r\n```\r\n\r\nIn your main code:\r\n\r\n```python\r\nimport pandas as pd\r\nfrom datarules import CheckList\r\n\r\ndf = pd.DataFrame([\r\n {\"width\": 3, \"height\": 7},\r\n {\"width\": 3, \"height\": 5, \"depth\": 1},\r\n {\"width\": 3, \"height\": 8},\r\n {\"width\": 3, \"height\": 3},\r\n {\"width\": 3, \"height\": -2, \"depth\": 4},\r\n])\r\n\r\nchecks = CheckList.from_file('checks.py')\r\nreport = checks.run(df)\r\nprint(report)\r\n```\r\n\r\nOutput:\r\n```\r\n name condition items passes fails NAs error warnings\r\n0 check_almost_square check_almost_square(width, height) 5 3 2 0 None 0\r\n1 check_not_too_deep check_not_too_deep(depth) 5 1 4 0 None 0\r\n\r\n```\r\n\r\n### Corrections\r\n\r\nIn corrections.py\r\n\r\n```python\r\nfrom datarules import correction\r\nfrom checks import check_almost_square\r\n\r\n\r\n@correction(condition=check_almost_square.fails)\r\ndef make_square(width, height):\r\n return {\"height\": height + (width - height) / 2}\r\n```\r\n\r\nIn your main code:\r\n\r\n```python\r\nfrom datarules import CorrectionList\r\n\r\ncorrections = CorrectionList.from_file('corrections.py')\r\nreport = corrections.run(df)\r\nprint(report)\r\n```\r\n\r\nOutput:\r\n```\r\n name condition action applied error warnings\r\n0 make_square check_almost_square.fails(width, height) make_square(width, height) 2 None 0\r\n```\r\n\r\n## Similar work (python)\r\n\r\nThese work on pandas, but only do validation:\r\n\r\n- [Pandera](https://pandera.readthedocs.io/en/stable/index.html) - Like us, their checks are also vectorized.\r\n- [Pandantic](https://github.com/wesselhuising/pandantic) - Combination of validation and parsing based on [pydantic](https://docs.pydantic.dev/latest/).\r\n\r\nThe following offer validation only, but none of them seem to be vectorized or support pandas directly.\r\n\r\n- [Great Expectations](https://github.com/great-expectations/great_expectations) - An overengineered library for validation that has confusing documentation.\r\n- [contessa](https://github.com/kiwicom/contessa) - Meant to be used against databases.\r\n- [validator](https://github.com/CSenshi/Validator)\r\n- [python-valid8](https://github.com/smarie/python-valid8)\r\n- [pyruler](https://github.com/danteay/pyruler) - Dead project that is rule-based.\r\n- [pyrules](https://github.com/miraculixx/pyrules) - Dead project that supports rule based corrections (but no validation).\r\n\r\n## Similar work (R)\r\n\r\nThis project is inspired by https://github.com/data-cleaning/.\r\nSimilar functionality can be found in the following R packages:\r\n\r\n- [validate](https://github.com/data-cleaning/validate) - Checking data (implemented)\r\n- [dcmodify](https://github.com/data-cleaning/dcmodify) - Correcting data (implemented)\r\n- [errorlocate](https://github.com/data-cleaning/errorlocate) - Identifying and removing errors (not yet implemented)\r\n- [deductive](https://github.com/data-cleaning/deductive) - Deductivate correction based on checks (not yet implemented)\r\n\r\nFeatures found in one of the packages above but not implemented here, might eventually make it into this package too.\r\n",
"bugtrack_url": null,
"license": "Apache License 2.0",
"summary": "Rules for validating and correcting datasets",
"version": "0.2.0",
"project_urls": {
"Changes": "https://github.com/lverweijen/datarules/blob/main/changes.md",
"Homepage": "https://github.com/lverweijen/datarules",
"Issues": "https://github.com/lverweijen/datarules/issues",
"Repository": "https://github.com/lverweijen/datarules"
},
"split_keywords": [
"rules",
" validation",
" checks",
" correction",
" data-editing",
" data-cleaning",
" data-cleansing"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "19546bd889ec1e9f940c949e66b97efdf84b77f40483202569bdf768f6b19704",
"md5": "b0d541fe2585a9eac0ca9a7c506b3698",
"sha256": "6ba045b1d2300d97eb948999e6cedb660b26724b90dd3db315a5136dd8372ea5"
},
"downloads": -1,
"filename": "datarules-0.2.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "b0d541fe2585a9eac0ca9a7c506b3698",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 18629,
"upload_time": "2024-08-07T21:35:32",
"upload_time_iso_8601": "2024-08-07T21:35:32.680091Z",
"url": "https://files.pythonhosted.org/packages/19/54/6bd889ec1e9f940c949e66b97efdf84b77f40483202569bdf768f6b19704/datarules-0.2.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-08-07 21:35:32",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "lverweijen",
"github_project": "datarules",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "datarules"
}