# data_check
data_check is a simple data validation tool. In its most basic form it will execute SQL queries and compare the results against CSV or Excel files. But there are more advanced features:
## Features
* [CSV checks](https://andrjas.github.io/data_check/csv_checks/): compare SQL queries against CSV files
* Excel support: Use Excel (xlsx) instead of CSV
* multiple environments (databases) in the configuration file
* [populate tables](https://andrjas.github.io/data_check/loading_data/) from CSV or Excel files
* [execute any SQL files on a database](https://andrjas.github.io/data_check/sql/)
* more complex [pipelines](https://andrjas.github.io/data_check/pipelines/)
* run any script/command (via pipelines)
* simplified checks for [empty datasets](https://andrjas.github.io/data_check/csv_checks/#empty-dataset-checks) and [full table comparison](https://andrjas.github.io/data_check/csv_checks/#full-table-checks)
* [lookups](https://andrjas.github.io/data_check/csv_checks/#lookups) to reuse the same data in multiple queries
* [test data generation](https://andrjas.github.io/data_check/test_data/)
## Database support
data_check is tested with these databases:
- PostgreSQL
- MySQL
- SQLite
- Oracle
- Microsoft SQL Server
Partially supported:
- DuckDB
- Databricks
Other databases supported by [SQLAlchemy](https://docs.sqlalchemy.org/en/20/dialects/) might also work.
## Quickstart
You need Python 3.9 or above to run data_check. The easiest way to install data_check is via [pipx](https://github.com/pipxproject/pipx):
`pipx install data-check`
The data_check Git repository is also a sample data_check project. Clone the repository, switch to the folder and run data_check:
```
git clone git@github.com:andrjas/data_check.git
cd data_check/example
data_check
```
This will run the tests in the _checks_ folder using the default connection as set in data_check.yml.
See the [documentation](https://andrjas.github.io/data_check) how to install data_check in different environments with additional database drivers and other usages of data_check.
## Project layout
data_check has a simple layout for projects: a single configuration file and a folder with the test files. You can also organize the test files in subfolders.
data_check.yml # The configuration file
checks/ # Default folder for data tests
some_test.sql # SQL file with the query to run against the database
some_test.csv # CSV file with the expected result
subfolder/ # Tests can be nested in subfolders
## CSV checks
This is the default mode when running data_check. data_check expects a SQL file and a CSV file. The SQL file will be executed against the database and the result is compared with the CSV file. If they match, the test is passed, otherwise it fails.
## Pipelines
If data_check finds a file named _data\_check\_pipeline.yml_ in a folder, it will treat this folder as a pipeline check. Instead of running [CSV checks](#csv-checks) it will execute the steps in the YAML file.
Example project with a pipeline:
data_check.yml
checks/
some_test.sql # this test will run in parallel to the pipeline test
some_test.csv
sample_pipeline/
data_check_pipeline.yml # configuration for the pipeline
data/
my_schema.some_table.csv # data for a table
data2/
some_data.csv # other data
some_checks/ # folder with CSV checks
check1.sql
check1.csl
...
run_this.sql # a SQL file that will be executed
cleanup.sql
other_pipeline/ # you can have multiple pipelines that will run in parallel
data_check_pipeline.yml
...
The file _sample\_pipeline/data\_check\_pipeline.yml_ can look like this:
```yaml
steps:
# this will truncate the table my_schema.some_table and load it with the data from data/my_schema.some_table.csv
- load: data
# this will execute the SQL statement in run_this.sql
- sql: run_this.sql
# this will append the data from data2/some_data.csv to my_schema.other_table
- load:
file: data2/some_data.csv
table: my_schema.other_table
mode: append
# this will run a python script and pass the connection name
- cmd: "python3 /path/to/my_pipeline.py --connection {{CONNECTION}}"
# this will run the CSV checks in the some_checks folder
- check: some_checks
```
Pipeline checks and simple CSV checks can coexist in a project.
## Documentation
See the [documentation](https://andrjas.github.io/data_check) how to setup data_check, how to create a new project and more options.
## License
[MIT](LICENSE)
Raw data
{
"_id": null,
"home_page": "https://andrjas.github.io/data_check/",
"name": "data-check",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.9,<3.12",
"maintainer_email": "",
"keywords": "data,validation,testing,quality",
"author": "Andreas Rjasanow",
"author_email": "andrjas@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/65/8e/16ccc8deaf5bda3a19dc76b7c157296edfc39d5aea474e0d16c79b4167ca/data_check-0.18.0.tar.gz",
"platform": null,
"description": "# data_check\n\ndata_check is a simple data validation tool. In its most basic form it will execute SQL queries and compare the results against CSV or Excel files. But there are more advanced features:\n\n## Features\n\n* [CSV checks](https://andrjas.github.io/data_check/csv_checks/): compare SQL queries against CSV files\n* Excel support: Use Excel (xlsx) instead of CSV\n* multiple environments (databases) in the configuration file\n* [populate tables](https://andrjas.github.io/data_check/loading_data/) from CSV or Excel files\n* [execute any SQL files on a database](https://andrjas.github.io/data_check/sql/)\n* more complex [pipelines](https://andrjas.github.io/data_check/pipelines/)\n* run any script/command (via pipelines)\n* simplified checks for [empty datasets](https://andrjas.github.io/data_check/csv_checks/#empty-dataset-checks) and [full table comparison](https://andrjas.github.io/data_check/csv_checks/#full-table-checks)\n* [lookups](https://andrjas.github.io/data_check/csv_checks/#lookups) to reuse the same data in multiple queries\n* [test data generation](https://andrjas.github.io/data_check/test_data/)\n\n## Database support\n\ndata_check is tested with these databases:\n\n- PostgreSQL\n- MySQL\n- SQLite\n- Oracle\n- Microsoft SQL Server\n\nPartially supported:\n\n- DuckDB\n- Databricks\n\nOther databases supported by [SQLAlchemy](https://docs.sqlalchemy.org/en/20/dialects/) might also work.\n\n## Quickstart\n\nYou need Python 3.9 or above to run data_check. The easiest way to install data_check is via [pipx](https://github.com/pipxproject/pipx):\n\n`pipx install data-check`\n\nThe data_check Git repository is also a sample data_check project. Clone the repository, switch to the folder and run data_check:\n\n```\ngit clone git@github.com:andrjas/data_check.git\ncd data_check/example\ndata_check\n```\n\nThis will run the tests in the _checks_ folder using the default connection as set in data_check.yml.\n\nSee the [documentation](https://andrjas.github.io/data_check) how to install data_check in different environments with additional database drivers and other usages of data_check.\n\n## Project layout\n\ndata_check has a simple layout for projects: a single configuration file and a folder with the test files. You can also organize the test files in subfolders.\n\n data_check.yml # The configuration file\n checks/ # Default folder for data tests\n some_test.sql # SQL file with the query to run against the database\n some_test.csv # CSV file with the expected result\n subfolder/ # Tests can be nested in subfolders\n\n## CSV checks\n\nThis is the default mode when running data_check. data_check expects a SQL file and a CSV file. The SQL file will be executed against the database and the result is compared with the CSV file. If they match, the test is passed, otherwise it fails.\n\n## Pipelines\n\nIf data_check finds a file named _data\\_check\\_pipeline.yml_ in a folder, it will treat this folder as a pipeline check. Instead of running [CSV checks](#csv-checks) it will execute the steps in the YAML file.\n\nExample project with a pipeline:\n\n data_check.yml\n checks/\n some_test.sql # this test will run in parallel to the pipeline test\n some_test.csv\n sample_pipeline/\n data_check_pipeline.yml # configuration for the pipeline\n data/\n my_schema.some_table.csv # data for a table\n data2/\n some_data.csv # other data\n some_checks/ # folder with CSV checks\n check1.sql\n check1.csl\n ...\n run_this.sql # a SQL file that will be executed\n cleanup.sql\n other_pipeline/ # you can have multiple pipelines that will run in parallel\n data_check_pipeline.yml\n ...\n\nThe file _sample\\_pipeline/data\\_check\\_pipeline.yml_ can look like this:\n\n```yaml\nsteps:\n # this will truncate the table my_schema.some_table and load it with the data from data/my_schema.some_table.csv\n - load: data\n # this will execute the SQL statement in run_this.sql\n - sql: run_this.sql\n # this will append the data from data2/some_data.csv to my_schema.other_table\n - load:\n file: data2/some_data.csv\n table: my_schema.other_table\n mode: append\n # this will run a python script and pass the connection name\n - cmd: \"python3 /path/to/my_pipeline.py --connection {{CONNECTION}}\"\n # this will run the CSV checks in the some_checks folder\n - check: some_checks\n```\n\nPipeline checks and simple CSV checks can coexist in a project.\n\n## Documentation\n\nSee the [documentation](https://andrjas.github.io/data_check) how to setup data_check, how to create a new project and more options.\n\n## License\n\n[MIT](LICENSE)\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "simple data validation",
"version": "0.18.0",
"project_urls": {
"Homepage": "https://andrjas.github.io/data_check/",
"Repository": "https://github.com/andrjas/data_check"
},
"split_keywords": [
"data",
"validation",
"testing",
"quality"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "83a6a3f033b38c80a4ef8f4542f3152be256130c8b6c4e62c1b1ea61f57afc44",
"md5": "cbd40030e2970a2b8c33c010d5f0a4fa",
"sha256": "bb1d23659efcf9f85e47efe97191241f1c6088696cad161e291109fb3db16740"
},
"downloads": -1,
"filename": "data_check-0.18.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "cbd40030e2970a2b8c33c010d5f0a4fa",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9,<3.12",
"size": 66836,
"upload_time": "2023-11-30T16:27:15",
"upload_time_iso_8601": "2023-11-30T16:27:15.984538Z",
"url": "https://files.pythonhosted.org/packages/83/a6/a3f033b38c80a4ef8f4542f3152be256130c8b6c4e62c1b1ea61f57afc44/data_check-0.18.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "658e16ccc8deaf5bda3a19dc76b7c157296edfc39d5aea474e0d16c79b4167ca",
"md5": "da06644a8936c36457bc6bdcc7662693",
"sha256": "27be7080072124a183622d04718ad4eee01c4177409e1cd050ead02392916f36"
},
"downloads": -1,
"filename": "data_check-0.18.0.tar.gz",
"has_sig": false,
"md5_digest": "da06644a8936c36457bc6bdcc7662693",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9,<3.12",
"size": 42621,
"upload_time": "2023-11-30T16:27:18",
"upload_time_iso_8601": "2023-11-30T16:27:18.183757Z",
"url": "https://files.pythonhosted.org/packages/65/8e/16ccc8deaf5bda3a19dc76b7c157296edfc39d5aea474e0d16c79b4167ca/data_check-0.18.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-11-30 16:27:18",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "andrjas",
"github_project": "data_check",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"tox": true,
"lcname": "data-check"
}