datarules


Namedatarules JSON
Version 0.2.0 PyPI version JSON
download
home_pageNone
SummaryRules for validating and correcting datasets
upload_time2024-08-07 21:35:32
maintainerNone
docs_urlNone
authorNone
requires_python>=3.10
licenseApache License 2.0
keywords rules validation checks correction data-editing data-cleaning data-cleansing
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # DataRules

## Goal and motivation

The idea of this project is to define rules to validate and correct datasets.
Whenever possible, it does this in a vectorized way, which makes this library fast.


Reasons to make this:
- Implement an alternative to https://github.com/data-cleaning/ based on python and pandas.
- Implement both validation and correction. Most existing packages provide validation only.
- Support a rule based way of data processing. The rules can be maintained in a separate file (python or yaml) if required.
- Apply vectorization to make processing fast.

## Usage

This package provides two operations on data:

- checks (if data is correct). Also knows as validations.
- corrections (how to fix incorrect data)

### Checks

In checks.py

```python
from datarules import check


@check(tags=["P1"])
def check_almost_square(width, height):
    return (width - height).abs() <= 4


@check(tags=["P3", "completeness"])
def check_not_too_deep(depth):
    return depth <= 2
```

In your main code:

```python
import pandas as pd
from datarules import CheckList

df = pd.DataFrame([
    {"width": 3, "height": 7},
    {"width": 3, "height": 5, "depth": 1},
    {"width": 3, "height": 8},
    {"width": 3, "height": 3},
    {"width": 3, "height": -2, "depth": 4},
])

checks = CheckList.from_file('checks.py')
report = checks.run(df)
print(report)
```

Output:
```
                  name                           condition  items  passes  fails  NAs error  warnings
0  check_almost_square  check_almost_square(width, height)      5       3      2    0  None         0
1   check_not_too_deep           check_not_too_deep(depth)      5       1      4    0  None         0

```

### Corrections

In corrections.py

```python
from datarules import correction
from checks import check_almost_square


@correction(condition=check_almost_square.fails)
def make_square(width, height):
    return {"height": height + (width - height) / 2}
```

In your main code:

```python
from datarules import CorrectionList

corrections = CorrectionList.from_file('corrections.py')
report = corrections.run(df)
print(report)
```

Output:
```
          name                                 condition                      action  applied error  warnings
0  make_square  check_almost_square.fails(width, height)  make_square(width, height)        2  None         0
```

## Similar work (python)

These work on pandas, but only do validation:

- [Pandera](https://pandera.readthedocs.io/en/stable/index.html) - Like us, their checks are also vectorized.
- [Pandantic](https://github.com/wesselhuising/pandantic) - Combination of validation and parsing based on [pydantic](https://docs.pydantic.dev/latest/).

The following offer validation only, but none of them seem to be vectorized or support pandas directly.

- [Great Expectations](https://github.com/great-expectations/great_expectations) - An overengineered library for validation that has confusing documentation.
- [contessa](https://github.com/kiwicom/contessa) - Meant to be used against databases.
- [validator](https://github.com/CSenshi/Validator)
- [python-valid8](https://github.com/smarie/python-valid8)
- [pyruler](https://github.com/danteay/pyruler) - Dead project that is rule-based.
- [pyrules](https://github.com/miraculixx/pyrules) - Dead project that supports rule based corrections (but no validation).

## Similar work (R)

This project is inspired by https://github.com/data-cleaning/.
Similar functionality can be found in the following R packages:

- [validate](https://github.com/data-cleaning/validate) - Checking data (implemented)
- [dcmodify](https://github.com/data-cleaning/dcmodify) - Correcting data (implemented)
- [errorlocate](https://github.com/data-cleaning/errorlocate) - Identifying and removing errors (not yet implemented)
- [deductive](https://github.com/data-cleaning/deductive) - Deductivate correction based on checks (not yet implemented)

Features found in one of the packages above but not implemented here, might eventually make it into this package too.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "datarules",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "rules, validation, checks, correction, data-editing, data-cleaning, data-cleansing",
    "author": null,
    "author_email": "lverweijen <lauwerund@gmail.com>",
    "download_url": null,
    "platform": null,
    "description": "# DataRules\r\n\r\n## Goal and motivation\r\n\r\nThe idea of this project is to define rules to validate and correct datasets.\r\nWhenever possible, it does this in a vectorized way, which makes this library fast.\r\n\r\n\r\nReasons to make this:\r\n- Implement an alternative to https://github.com/data-cleaning/ based on python and pandas.\r\n- Implement both validation and correction. Most existing packages provide validation only.\r\n- Support a rule based way of data processing. The rules can be maintained in a separate file (python or yaml) if required.\r\n- Apply vectorization to make processing fast.\r\n\r\n## Usage\r\n\r\nThis package provides two operations on data:\r\n\r\n- checks (if data is correct). Also knows as validations.\r\n- corrections (how to fix incorrect data)\r\n\r\n### Checks\r\n\r\nIn checks.py\r\n\r\n```python\r\nfrom datarules import check\r\n\r\n\r\n@check(tags=[\"P1\"])\r\ndef check_almost_square(width, height):\r\n    return (width - height).abs() <= 4\r\n\r\n\r\n@check(tags=[\"P3\", \"completeness\"])\r\ndef check_not_too_deep(depth):\r\n    return depth <= 2\r\n```\r\n\r\nIn your main code:\r\n\r\n```python\r\nimport pandas as pd\r\nfrom datarules import CheckList\r\n\r\ndf = pd.DataFrame([\r\n    {\"width\": 3, \"height\": 7},\r\n    {\"width\": 3, \"height\": 5, \"depth\": 1},\r\n    {\"width\": 3, \"height\": 8},\r\n    {\"width\": 3, \"height\": 3},\r\n    {\"width\": 3, \"height\": -2, \"depth\": 4},\r\n])\r\n\r\nchecks = CheckList.from_file('checks.py')\r\nreport = checks.run(df)\r\nprint(report)\r\n```\r\n\r\nOutput:\r\n```\r\n                  name                           condition  items  passes  fails  NAs error  warnings\r\n0  check_almost_square  check_almost_square(width, height)      5       3      2    0  None         0\r\n1   check_not_too_deep           check_not_too_deep(depth)      5       1      4    0  None         0\r\n\r\n```\r\n\r\n### Corrections\r\n\r\nIn corrections.py\r\n\r\n```python\r\nfrom datarules import correction\r\nfrom checks import check_almost_square\r\n\r\n\r\n@correction(condition=check_almost_square.fails)\r\ndef make_square(width, height):\r\n    return {\"height\": height + (width - height) / 2}\r\n```\r\n\r\nIn your main code:\r\n\r\n```python\r\nfrom datarules import CorrectionList\r\n\r\ncorrections = CorrectionList.from_file('corrections.py')\r\nreport = corrections.run(df)\r\nprint(report)\r\n```\r\n\r\nOutput:\r\n```\r\n          name                                 condition                      action  applied error  warnings\r\n0  make_square  check_almost_square.fails(width, height)  make_square(width, height)        2  None         0\r\n```\r\n\r\n## Similar work (python)\r\n\r\nThese work on pandas, but only do validation:\r\n\r\n- [Pandera](https://pandera.readthedocs.io/en/stable/index.html) - Like us, their checks are also vectorized.\r\n- [Pandantic](https://github.com/wesselhuising/pandantic) - Combination of validation and parsing based on [pydantic](https://docs.pydantic.dev/latest/).\r\n\r\nThe following offer validation only, but none of them seem to be vectorized or support pandas directly.\r\n\r\n- [Great Expectations](https://github.com/great-expectations/great_expectations) - An overengineered library for validation that has confusing documentation.\r\n- [contessa](https://github.com/kiwicom/contessa) - Meant to be used against databases.\r\n- [validator](https://github.com/CSenshi/Validator)\r\n- [python-valid8](https://github.com/smarie/python-valid8)\r\n- [pyruler](https://github.com/danteay/pyruler) - Dead project that is rule-based.\r\n- [pyrules](https://github.com/miraculixx/pyrules) - Dead project that supports rule based corrections (but no validation).\r\n\r\n## Similar work (R)\r\n\r\nThis project is inspired by https://github.com/data-cleaning/.\r\nSimilar functionality can be found in the following R packages:\r\n\r\n- [validate](https://github.com/data-cleaning/validate) - Checking data (implemented)\r\n- [dcmodify](https://github.com/data-cleaning/dcmodify) - Correcting data (implemented)\r\n- [errorlocate](https://github.com/data-cleaning/errorlocate) - Identifying and removing errors (not yet implemented)\r\n- [deductive](https://github.com/data-cleaning/deductive) - Deductivate correction based on checks (not yet implemented)\r\n\r\nFeatures found in one of the packages above but not implemented here, might eventually make it into this package too.\r\n",
    "bugtrack_url": null,
    "license": "Apache License 2.0",
    "summary": "Rules for validating and correcting datasets",
    "version": "0.2.0",
    "project_urls": {
        "Changes": "https://github.com/lverweijen/datarules/blob/main/changes.md",
        "Homepage": "https://github.com/lverweijen/datarules",
        "Issues": "https://github.com/lverweijen/datarules/issues",
        "Repository": "https://github.com/lverweijen/datarules"
    },
    "split_keywords": [
        "rules",
        " validation",
        " checks",
        " correction",
        " data-editing",
        " data-cleaning",
        " data-cleansing"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "19546bd889ec1e9f940c949e66b97efdf84b77f40483202569bdf768f6b19704",
                "md5": "b0d541fe2585a9eac0ca9a7c506b3698",
                "sha256": "6ba045b1d2300d97eb948999e6cedb660b26724b90dd3db315a5136dd8372ea5"
            },
            "downloads": -1,
            "filename": "datarules-0.2.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "b0d541fe2585a9eac0ca9a7c506b3698",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 18629,
            "upload_time": "2024-08-07T21:35:32",
            "upload_time_iso_8601": "2024-08-07T21:35:32.680091Z",
            "url": "https://files.pythonhosted.org/packages/19/54/6bd889ec1e9f940c949e66b97efdf84b77f40483202569bdf768f6b19704/datarules-0.2.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-08-07 21:35:32",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "lverweijen",
    "github_project": "datarules",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "datarules"
}
        
Elapsed time: 0.43164s