data_check


Namedata_check JSON
Version 0.19.0 PyPI version JSON
download
home_pagehttps://andrjas.github.io/data_check/
Summarysimple data validation
upload_time2024-03-18 06:05:49
maintainer
docs_urlNone
authorAndreas Rjasanow
requires_python>=3.9,<3.13
licenseMIT
keywords data validation testing quality
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # data_check

data_check is a simple data validation tool. In its most basic form it will execute SQL queries and compare the results against CSV or Excel files. But there are more advanced features:

## Features

* [CSV checks](https://andrjas.github.io/data_check/csv_checks/): compare SQL queries against CSV files
* Excel support: Use Excel (xlsx) instead of CSV
* multiple environments (databases) in the configuration file
* [populate tables](https://andrjas.github.io/data_check/loading_data/) from CSV or Excel files
* [execute any SQL files on a database](https://andrjas.github.io/data_check/sql/)
* more complex [pipelines](https://andrjas.github.io/data_check/pipelines/)
* run any script/command (via pipelines)
* simplified checks for [empty datasets](https://andrjas.github.io/data_check/csv_checks/#empty-dataset-checks) and [full table comparison](https://andrjas.github.io/data_check/csv_checks/#full-table-checks)
* [lookups](https://andrjas.github.io/data_check/csv_checks/#lookups) to reuse the same data in multiple queries
* [test data generation](https://andrjas.github.io/data_check/test_data/)

## Database support

data_check is tested with these databases:

- PostgreSQL
- MySQL
- SQLite
- Oracle
- Microsoft SQL Server

Partially supported:

- DuckDB
- Databricks

Other databases supported by [SQLAlchemy](https://docs.sqlalchemy.org/en/20/dialects/) might also work.

## Quickstart

You need Python 3.9 or above to run data_check. The easiest way to install data_check is via [pipx](https://github.com/pipxproject/pipx):

`pipx install data-check`

The data_check Git repository is also a sample data_check project. Clone the repository, switch to the folder and run data_check:

```
git clone git@github.com:andrjas/data_check.git
cd data_check/example
data_check
```

This will run the tests in the _checks_ folder using the default connection as set in data_check.yml.

See the [documentation](https://andrjas.github.io/data_check) how to install data_check in different environments with additional database drivers and other usages of data_check.

## Project layout

data_check has a simple layout for projects: a single configuration file and a folder with the test files. You can also organize the test files in subfolders.

    data_check.yml    # The configuration file
    checks/           # Default folder for data tests
        some_test.sql # SQL file with the query to run against the database
        some_test.csv # CSV file with the expected result
        subfolder/    # Tests can be nested in subfolders

## CSV checks

This is the default mode when running data_check. data_check expects a SQL file and a CSV file. The SQL file will be executed against the database and the result is compared with the CSV file. If they match, the test is passed, otherwise it fails.

## Pipelines

If data_check finds a file named _data\_check\_pipeline.yml_ in a folder, it will treat this folder as a pipeline check. Instead of running [CSV checks](#csv-checks) it will execute the steps in the YAML file.

Example project with a pipeline:

    data_check.yml
    checks/
        some_test.sql                # this test will run in parallel to the pipeline test
        some_test.csv
        sample_pipeline/
            data_check_pipeline.yml  # configuration for the pipeline
            data/
                my_schema.some_table.csv       # data for a table
            data2/
                some_data.csv        # other data
            some_checks/             # folder with CSV checks
                check1.sql
                check1.csl
                ...
            run_this.sql             # a SQL file that will be executed
            cleanup.sql
        other_pipeline/              # you can have multiple pipelines that will run in parallel
            data_check_pipeline.yml
            ...

The file _sample\_pipeline/data\_check\_pipeline.yml_ can look like this:

```yaml
steps:
    # this will truncate the table my_schema.some_table and load it with the data from data/my_schema.some_table.csv
    - load: data
    # this will execute the SQL statement in run_this.sql
    - sql: run_this.sql
    # this will append the data from data2/some_data.csv to my_schema.other_table
    - load:
        file: data2/some_data.csv
        table: my_schema.other_table
        mode: append
    # this will run a python script and pass the connection name
    - cmd: "python3 /path/to/my_pipeline.py --connection {{CONNECTION}}"
    # this will run the CSV checks in the some_checks folder
    - check: some_checks
```

Pipeline checks and simple CSV checks can coexist in a project.

## Documentation

See the [documentation](https://andrjas.github.io/data_check) how to setup data_check, how to create a new project and more options.

## License

[MIT](LICENSE)

            

Raw data

            {
    "_id": null,
    "home_page": "https://andrjas.github.io/data_check/",
    "name": "data_check",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.9,<3.13",
    "maintainer_email": "",
    "keywords": "data,validation,testing,quality",
    "author": "Andreas Rjasanow",
    "author_email": "andrjas@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/70/aa/9ae459c4fe2cbc01085ae6bc35a547259e5be939c65b2354c4d5a3cdc97b/data_check-0.19.0.tar.gz",
    "platform": null,
    "description": "# data_check\n\ndata_check is a simple data validation tool. In its most basic form it will execute SQL queries and compare the results against CSV or Excel files. But there are more advanced features:\n\n## Features\n\n* [CSV checks](https://andrjas.github.io/data_check/csv_checks/): compare SQL queries against CSV files\n* Excel support: Use Excel (xlsx) instead of CSV\n* multiple environments (databases) in the configuration file\n* [populate tables](https://andrjas.github.io/data_check/loading_data/) from CSV or Excel files\n* [execute any SQL files on a database](https://andrjas.github.io/data_check/sql/)\n* more complex [pipelines](https://andrjas.github.io/data_check/pipelines/)\n* run any script/command (via pipelines)\n* simplified checks for [empty datasets](https://andrjas.github.io/data_check/csv_checks/#empty-dataset-checks) and [full table comparison](https://andrjas.github.io/data_check/csv_checks/#full-table-checks)\n* [lookups](https://andrjas.github.io/data_check/csv_checks/#lookups) to reuse the same data in multiple queries\n* [test data generation](https://andrjas.github.io/data_check/test_data/)\n\n## Database support\n\ndata_check is tested with these databases:\n\n- PostgreSQL\n- MySQL\n- SQLite\n- Oracle\n- Microsoft SQL Server\n\nPartially supported:\n\n- DuckDB\n- Databricks\n\nOther databases supported by [SQLAlchemy](https://docs.sqlalchemy.org/en/20/dialects/) might also work.\n\n## Quickstart\n\nYou need Python 3.9 or above to run data_check. The easiest way to install data_check is via [pipx](https://github.com/pipxproject/pipx):\n\n`pipx install data-check`\n\nThe data_check Git repository is also a sample data_check project. Clone the repository, switch to the folder and run data_check:\n\n```\ngit clone git@github.com:andrjas/data_check.git\ncd data_check/example\ndata_check\n```\n\nThis will run the tests in the _checks_ folder using the default connection as set in data_check.yml.\n\nSee the [documentation](https://andrjas.github.io/data_check) how to install data_check in different environments with additional database drivers and other usages of data_check.\n\n## Project layout\n\ndata_check has a simple layout for projects: a single configuration file and a folder with the test files. You can also organize the test files in subfolders.\n\n    data_check.yml    # The configuration file\n    checks/           # Default folder for data tests\n        some_test.sql # SQL file with the query to run against the database\n        some_test.csv # CSV file with the expected result\n        subfolder/    # Tests can be nested in subfolders\n\n## CSV checks\n\nThis is the default mode when running data_check. data_check expects a SQL file and a CSV file. The SQL file will be executed against the database and the result is compared with the CSV file. If they match, the test is passed, otherwise it fails.\n\n## Pipelines\n\nIf data_check finds a file named _data\\_check\\_pipeline.yml_ in a folder, it will treat this folder as a pipeline check. Instead of running [CSV checks](#csv-checks) it will execute the steps in the YAML file.\n\nExample project with a pipeline:\n\n    data_check.yml\n    checks/\n        some_test.sql                # this test will run in parallel to the pipeline test\n        some_test.csv\n        sample_pipeline/\n            data_check_pipeline.yml  # configuration for the pipeline\n            data/\n                my_schema.some_table.csv       # data for a table\n            data2/\n                some_data.csv        # other data\n            some_checks/             # folder with CSV checks\n                check1.sql\n                check1.csl\n                ...\n            run_this.sql             # a SQL file that will be executed\n            cleanup.sql\n        other_pipeline/              # you can have multiple pipelines that will run in parallel\n            data_check_pipeline.yml\n            ...\n\nThe file _sample\\_pipeline/data\\_check\\_pipeline.yml_ can look like this:\n\n```yaml\nsteps:\n    # this will truncate the table my_schema.some_table and load it with the data from data/my_schema.some_table.csv\n    - load: data\n    # this will execute the SQL statement in run_this.sql\n    - sql: run_this.sql\n    # this will append the data from data2/some_data.csv to my_schema.other_table\n    - load:\n        file: data2/some_data.csv\n        table: my_schema.other_table\n        mode: append\n    # this will run a python script and pass the connection name\n    - cmd: \"python3 /path/to/my_pipeline.py --connection {{CONNECTION}}\"\n    # this will run the CSV checks in the some_checks folder\n    - check: some_checks\n```\n\nPipeline checks and simple CSV checks can coexist in a project.\n\n## Documentation\n\nSee the [documentation](https://andrjas.github.io/data_check) how to setup data_check, how to create a new project and more options.\n\n## License\n\n[MIT](LICENSE)\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "simple data validation",
    "version": "0.19.0",
    "project_urls": {
        "Homepage": "https://andrjas.github.io/data_check/",
        "Repository": "https://github.com/andrjas/data_check"
    },
    "split_keywords": [
        "data",
        "validation",
        "testing",
        "quality"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "6f3f9e7f1e54e1698a05a0912e9242aef496d8fd94e406ba9e94775e6123c368",
                "md5": "d1bd115d3adbb210b9fb1ecbe66d9d61",
                "sha256": "f6b5c131fb14ff33ddf8aebc487dd1753c97d1b44afe9ece4b567bcf5869b35a"
            },
            "downloads": -1,
            "filename": "data_check-0.19.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "d1bd115d3adbb210b9fb1ecbe66d9d61",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9,<3.13",
            "size": 70183,
            "upload_time": "2024-03-18T06:05:46",
            "upload_time_iso_8601": "2024-03-18T06:05:46.698355Z",
            "url": "https://files.pythonhosted.org/packages/6f/3f/9e7f1e54e1698a05a0912e9242aef496d8fd94e406ba9e94775e6123c368/data_check-0.19.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "70aa9ae459c4fe2cbc01085ae6bc35a547259e5be939c65b2354c4d5a3cdc97b",
                "md5": "62af636bbed22b99da958cafd09ebbef",
                "sha256": "775ff41677b8142d202e751b6818640cb43407c87439c4e432f1fa1dbfd24f3c"
            },
            "downloads": -1,
            "filename": "data_check-0.19.0.tar.gz",
            "has_sig": false,
            "md5_digest": "62af636bbed22b99da958cafd09ebbef",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9,<3.13",
            "size": 43855,
            "upload_time": "2024-03-18T06:05:49",
            "upload_time_iso_8601": "2024-03-18T06:05:49.015271Z",
            "url": "https://files.pythonhosted.org/packages/70/aa/9ae459c4fe2cbc01085ae6bc35a547259e5be939c65b2354c4d5a3cdc97b/data_check-0.19.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-03-18 06:05:49",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "andrjas",
    "github_project": "data_check",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "tox": true,
    "lcname": "data_check"
}
        
Elapsed time: 0.17148s