# data_check
data_check is a simple data validation tool. In its most basic form it will execute SQL queries and compare the results against CSV or Excel files. But there are more advanced features:
## Features
* [CSV checks](https://andrjas.github.io/data_check/csv_checks/): compare SQL queries against CSV files
* Excel support: Use Excel (xlsx) instead of CSV
* multiple environments (databases) in the configuration file
* [populate tables](https://andrjas.github.io/data_check/loading_data/) from CSV or Excel files
* [execute any SQL files on a database](https://andrjas.github.io/data_check/sql/)
* more complex [pipelines](https://andrjas.github.io/data_check/pipelines/)
* run any script/command (via pipelines)
* simplified checks for [empty datasets](https://andrjas.github.io/data_check/csv_checks/#empty-dataset-checks) and [full table comparison](https://andrjas.github.io/data_check/csv_checks/#full-table-checks)
* [lookups](https://andrjas.github.io/data_check/csv_checks/#lookups) to reuse the same data in multiple queries
* [test data generation](https://andrjas.github.io/data_check/test_data/)
## Database support
data_check is tested with these databases:
- PostgreSQL
- MySQL
- SQLite
- Oracle
- Microsoft SQL Server
Partially supported:
- DuckDB
- Databricks
Other databases supported by [SQLAlchemy](https://docs.sqlalchemy.org/en/20/dialects/) might also work.
## Quickstart
You need Python 3.9 or above to run data_check. The easiest way to install data_check is via [pipx](https://github.com/pipxproject/pipx):
`pipx install data-check`
The data_check Git repository is also a sample data_check project. Clone the repository, switch to the folder and run data_check:
```
git clone git@github.com:andrjas/data_check.git
cd data_check/example
data_check
```
This will run the tests in the _checks_ folder using the default connection as set in data_check.yml.
See the [documentation](https://andrjas.github.io/data_check) how to install data_check in different environments with additional database drivers and other usages of data_check.
## Project layout
data_check has a simple layout for projects: a single configuration file and a folder with the test files. You can also organize the test files in subfolders.
data_check.yml # The configuration file
checks/ # Default folder for data tests
some_test.sql # SQL file with the query to run against the database
some_test.csv # CSV file with the expected result
subfolder/ # Tests can be nested in subfolders
## CSV checks
This is the default mode when running data_check. data_check expects a SQL file and a CSV file. The SQL file will be executed against the database and the result is compared with the CSV file. If they match, the test is passed, otherwise it fails.
## Pipelines
If data_check finds a file named _data\_check\_pipeline.yml_ in a folder, it will treat this folder as a pipeline check. Instead of running [CSV checks](#csv-checks) it will execute the steps in the YAML file.
Example project with a pipeline:
data_check.yml
checks/
some_test.sql # this test will run in parallel to the pipeline test
some_test.csv
sample_pipeline/
data_check_pipeline.yml # configuration for the pipeline
data/
my_schema.some_table.csv # data for a table
data2/
some_data.csv # other data
some_checks/ # folder with CSV checks
check1.sql
check1.csl
...
run_this.sql # a SQL file that will be executed
cleanup.sql
other_pipeline/ # you can have multiple pipelines that will run in parallel
data_check_pipeline.yml
...
The file _sample\_pipeline/data\_check\_pipeline.yml_ can look like this:
```yaml
steps:
# this will truncate the table my_schema.some_table and load it with the data from data/my_schema.some_table.csv
- load: data
# this will execute the SQL statement in run_this.sql
- sql: run_this.sql
# this will append the data from data2/some_data.csv to my_schema.other_table
- load:
file: data2/some_data.csv
table: my_schema.other_table
mode: append
# this will run a python script and pass the connection name
- cmd: "python3 /path/to/my_pipeline.py --connection {{CONNECTION}}"
# this will run the CSV checks in the some_checks folder
- check: some_checks
```
Pipeline checks and simple CSV checks can coexist in a project.
## Documentation
See the [documentation](https://andrjas.github.io/data_check) how to setup data_check, how to create a new project and more options.
## License
[MIT](LICENSE)
Raw data
{
"_id": null,
"home_page": "https://andrjas.github.io/data_check/",
"name": "data_check",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.9,<3.13",
"maintainer_email": "",
"keywords": "data,validation,testing,quality",
"author": "Andreas Rjasanow",
"author_email": "andrjas@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/70/aa/9ae459c4fe2cbc01085ae6bc35a547259e5be939c65b2354c4d5a3cdc97b/data_check-0.19.0.tar.gz",
"platform": null,
"description": "# data_check\n\ndata_check is a simple data validation tool. In its most basic form it will execute SQL queries and compare the results against CSV or Excel files. But there are more advanced features:\n\n## Features\n\n* [CSV checks](https://andrjas.github.io/data_check/csv_checks/): compare SQL queries against CSV files\n* Excel support: Use Excel (xlsx) instead of CSV\n* multiple environments (databases) in the configuration file\n* [populate tables](https://andrjas.github.io/data_check/loading_data/) from CSV or Excel files\n* [execute any SQL files on a database](https://andrjas.github.io/data_check/sql/)\n* more complex [pipelines](https://andrjas.github.io/data_check/pipelines/)\n* run any script/command (via pipelines)\n* simplified checks for [empty datasets](https://andrjas.github.io/data_check/csv_checks/#empty-dataset-checks) and [full table comparison](https://andrjas.github.io/data_check/csv_checks/#full-table-checks)\n* [lookups](https://andrjas.github.io/data_check/csv_checks/#lookups) to reuse the same data in multiple queries\n* [test data generation](https://andrjas.github.io/data_check/test_data/)\n\n## Database support\n\ndata_check is tested with these databases:\n\n- PostgreSQL\n- MySQL\n- SQLite\n- Oracle\n- Microsoft SQL Server\n\nPartially supported:\n\n- DuckDB\n- Databricks\n\nOther databases supported by [SQLAlchemy](https://docs.sqlalchemy.org/en/20/dialects/) might also work.\n\n## Quickstart\n\nYou need Python 3.9 or above to run data_check. The easiest way to install data_check is via [pipx](https://github.com/pipxproject/pipx):\n\n`pipx install data-check`\n\nThe data_check Git repository is also a sample data_check project. Clone the repository, switch to the folder and run data_check:\n\n```\ngit clone git@github.com:andrjas/data_check.git\ncd data_check/example\ndata_check\n```\n\nThis will run the tests in the _checks_ folder using the default connection as set in data_check.yml.\n\nSee the [documentation](https://andrjas.github.io/data_check) how to install data_check in different environments with additional database drivers and other usages of data_check.\n\n## Project layout\n\ndata_check has a simple layout for projects: a single configuration file and a folder with the test files. You can also organize the test files in subfolders.\n\n data_check.yml # The configuration file\n checks/ # Default folder for data tests\n some_test.sql # SQL file with the query to run against the database\n some_test.csv # CSV file with the expected result\n subfolder/ # Tests can be nested in subfolders\n\n## CSV checks\n\nThis is the default mode when running data_check. data_check expects a SQL file and a CSV file. The SQL file will be executed against the database and the result is compared with the CSV file. If they match, the test is passed, otherwise it fails.\n\n## Pipelines\n\nIf data_check finds a file named _data\\_check\\_pipeline.yml_ in a folder, it will treat this folder as a pipeline check. Instead of running [CSV checks](#csv-checks) it will execute the steps in the YAML file.\n\nExample project with a pipeline:\n\n data_check.yml\n checks/\n some_test.sql # this test will run in parallel to the pipeline test\n some_test.csv\n sample_pipeline/\n data_check_pipeline.yml # configuration for the pipeline\n data/\n my_schema.some_table.csv # data for a table\n data2/\n some_data.csv # other data\n some_checks/ # folder with CSV checks\n check1.sql\n check1.csl\n ...\n run_this.sql # a SQL file that will be executed\n cleanup.sql\n other_pipeline/ # you can have multiple pipelines that will run in parallel\n data_check_pipeline.yml\n ...\n\nThe file _sample\\_pipeline/data\\_check\\_pipeline.yml_ can look like this:\n\n```yaml\nsteps:\n # this will truncate the table my_schema.some_table and load it with the data from data/my_schema.some_table.csv\n - load: data\n # this will execute the SQL statement in run_this.sql\n - sql: run_this.sql\n # this will append the data from data2/some_data.csv to my_schema.other_table\n - load:\n file: data2/some_data.csv\n table: my_schema.other_table\n mode: append\n # this will run a python script and pass the connection name\n - cmd: \"python3 /path/to/my_pipeline.py --connection {{CONNECTION}}\"\n # this will run the CSV checks in the some_checks folder\n - check: some_checks\n```\n\nPipeline checks and simple CSV checks can coexist in a project.\n\n## Documentation\n\nSee the [documentation](https://andrjas.github.io/data_check) how to setup data_check, how to create a new project and more options.\n\n## License\n\n[MIT](LICENSE)\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "simple data validation",
"version": "0.19.0",
"project_urls": {
"Homepage": "https://andrjas.github.io/data_check/",
"Repository": "https://github.com/andrjas/data_check"
},
"split_keywords": [
"data",
"validation",
"testing",
"quality"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "6f3f9e7f1e54e1698a05a0912e9242aef496d8fd94e406ba9e94775e6123c368",
"md5": "d1bd115d3adbb210b9fb1ecbe66d9d61",
"sha256": "f6b5c131fb14ff33ddf8aebc487dd1753c97d1b44afe9ece4b567bcf5869b35a"
},
"downloads": -1,
"filename": "data_check-0.19.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "d1bd115d3adbb210b9fb1ecbe66d9d61",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9,<3.13",
"size": 70183,
"upload_time": "2024-03-18T06:05:46",
"upload_time_iso_8601": "2024-03-18T06:05:46.698355Z",
"url": "https://files.pythonhosted.org/packages/6f/3f/9e7f1e54e1698a05a0912e9242aef496d8fd94e406ba9e94775e6123c368/data_check-0.19.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "70aa9ae459c4fe2cbc01085ae6bc35a547259e5be939c65b2354c4d5a3cdc97b",
"md5": "62af636bbed22b99da958cafd09ebbef",
"sha256": "775ff41677b8142d202e751b6818640cb43407c87439c4e432f1fa1dbfd24f3c"
},
"downloads": -1,
"filename": "data_check-0.19.0.tar.gz",
"has_sig": false,
"md5_digest": "62af636bbed22b99da958cafd09ebbef",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9,<3.13",
"size": 43855,
"upload_time": "2024-03-18T06:05:49",
"upload_time_iso_8601": "2024-03-18T06:05:49.015271Z",
"url": "https://files.pythonhosted.org/packages/70/aa/9ae459c4fe2cbc01085ae6bc35a547259e5be939c65b2354c4d5a3cdc97b/data_check-0.19.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-03-18 06:05:49",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "andrjas",
"github_project": "data_check",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"tox": true,
"lcname": "data_check"
}