Name | dfschema JSON |
Version |
0.0.11
JSON |
| download |
home_page | |
Summary | lightweight pandas.DataFrame schema |
upload_time | 2023-06-14 18:17:52 |
maintainer | |
docs_url | None |
author | Philipp |
requires_python | >=3.7.1,<4.0 |
license | |
keywords |
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# DFS (aka Dataframe_Schema)
**DFS** is a lightweight validator for `pandas.DataFrame`. You can think of it as a `jsonschema` for dataframe.
Key features:
1. **Lightweight**: only dependent on `pandas` and `pydantic` (which depends only on `typing_extensions`)
2. **Explicit**: inspired by `JsonSchema`, all schemas are stored as json (or yaml) files and can be generated or changed on the fly.
3. **Simple**: Easy to use, no need to change your workflow and dive into the implementation details.
4. **Comprehensive**: Summarizes all errors in a single summary exception, checks for distributions, works on subsets of the dataframe
5. **Rapid**: base schemas can be generated from given dataframe or sql query (using `pd.read_sql`).
6. **Handy**: Supports command line interface (with `[cli]` extra).
7. **Extendable**: Core idea is to validate *dataframes* of any type. While now supports only pandas, we'll add abstractions to run same checks on different types of dataframes (CuDF, Dask, SparkDF, etc )
## QuickStart
### 1. Validate DataFrame
Via wrapper
```python
import pandas as pd
import dfschema as dfs
df = pd.DataFrame({
"a": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
"b": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
})
schema_pass = {
"shape": {"min_rows": 10}
}
schema_raise = {
"shape": {"min_rows": 20}
}
dfs.validate(df, schema_pass) # won't raise any issues
dfs.validate(df, schema_raise) # will Raise DataFrameSchemaError
```
Alternatively (v2 optional), you can use the root class, `DfSchema`:
```python
dfs.DfSchema.from_dict(schema_pass).validate(df) # won't raise any issues
dfs.DfSchema.from_dict(schema_raise).validate(df) # will Raise DataFrameSchemaError
```
### 2. Generate Schema
```python
dfs.DfSchema.from_df(df)
```
### 3. Read and Write Schemas
```python
schema = dfs.DfSchema.from_file('schema.json')
schema.to_file("schema.yml")
```
### 4. Using CLI
> Note: requires [cli] extra as relies on `Typer` and `click`
#### Validate via CLI
```shell
dfschema validate --read_kwargs_json '{delimiter="|"}' FILEPATH SCHEMA_FILEPATH
```
Supports
- csv
- xlsx
- parquet
- feather
#### Generate via CLI
```shell
dfs generate --format 'yaml' DATA_PATH > schema.yaml
```
## Installation
WIP
## Alternatives
- [TableScheme](https://pypi.org/project/tableschema/)
- [GreatExpectations](https://greatexpectations.io/). Large and complex package with Html reports, Airflow Operator, connectors, etc. an work on out-of-memory data, SQL databases, parquet, etc
- [Pandera](https://pandera.readthedocs.io/en/stable/) - awesome package, great and suitable for type hinting, compatible with `hypothesis`
- [great talk](https://www.youtube.com/watch?v=PI5UmKi14cM)
- [Tensorflow validate](https://www.tensorflow.org/tfx/guide/tfdv)
- [DTF expectations](https://github.com/calogica/dbt-expectations)
## Changes
- [[changelog]]
## Roadmap
- [ ] Add tutorial Notebook
- [ ] Support tableschema
- [ ] Support Modin models
- [ ] Support SQLAlchemy ORM models
- [ ] Built-in Airflow Operator?
- [ ] Interactive CLI/jupyter for schema generation
Raw data
{
"_id": null,
"home_page": "",
"name": "dfschema",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.7.1,<4.0",
"maintainer_email": "",
"keywords": "",
"author": "Philipp",
"author_email": "philippk@zillowgroup.com",
"download_url": "https://files.pythonhosted.org/packages/ef/0a/430df3674e777cf289342b558340ed971ed381b48374481d54247c2c833b/dfschema-0.0.11.tar.gz",
"platform": null,
"description": "# DFS (aka Dataframe_Schema)\n\n**DFS** is a lightweight validator for `pandas.DataFrame`. You can think of it as a `jsonschema` for dataframe. \n\nKey features:\n1. **Lightweight**: only dependent on `pandas` and `pydantic` (which depends only on `typing_extensions`)\n2. **Explicit**: inspired by `JsonSchema`, all schemas are stored as json (or yaml) files and can be generated or changed on the fly.\n3. **Simple**: Easy to use, no need to change your workflow and dive into the implementation details. \n4. **Comprehensive**: Summarizes all errors in a single summary exception, checks for distributions, works on subsets of the dataframe \n5. **Rapid**: base schemas can be generated from given dataframe or sql query (using `pd.read_sql`).\n6. **Handy**: Supports command line interface (with `[cli]` extra).\n7. **Extendable**: Core idea is to validate *dataframes* of any type. While now supports only pandas, we'll add abstractions to run same checks on different types of dataframes (CuDF, Dask, SparkDF, etc )\n\n## QuickStart\n\n### 1. Validate DataFrame\n\nVia wrapper\n```python\nimport pandas as pd\nimport dfschema as dfs\n\n\ndf = pd.DataFrame({\n \"a\": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],\n \"b\": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]\n})\n\nschema_pass = {\n \"shape\": {\"min_rows\": 10}\n}\n\nschema_raise = {\n \"shape\": {\"min_rows\": 20}\n}\n\n\ndfs.validate(df, schema_pass) # won't raise any issues\ndfs.validate(df, schema_raise) # will Raise DataFrameSchemaError\n```\nAlternatively (v2 optional), you can use the root class, `DfSchema`:\n```python\ndfs.DfSchema.from_dict(schema_pass).validate(df) # won't raise any issues\ndfs.DfSchema.from_dict(schema_raise).validate(df) # will Raise DataFrameSchemaError\n```\n\n### 2. Generate Schema\n\n```python\ndfs.DfSchema.from_df(df)\n```\n### 3. Read and Write Schemas\n \n```python\nschema = dfs.DfSchema.from_file('schema.json')\nschema.to_file(\"schema.yml\")\n```\n\n### 4. Using CLI\n> Note: requires [cli] extra as relies on `Typer` and `click`\n\n#### Validate via CLI\n```shell\ndfschema validate --read_kwargs_json '{delimiter=\"|\"}' FILEPATH SCHEMA_FILEPATH\n```\nSupports\n- csv\n- xlsx\n- parquet\n- feather\n\n#### Generate via CLI\n```shell\ndfs generate --format 'yaml' DATA_PATH > schema.yaml\n```\n\n## Installation\n\nWIP\n\n## Alternatives\n- [TableScheme](https://pypi.org/project/tableschema/)\n- [GreatExpectations](https://greatexpectations.io/). Large and complex package with Html reports, Airflow Operator, connectors, etc. an work on out-of-memory data, SQL databases, parquet, etc\n- [Pandera](https://pandera.readthedocs.io/en/stable/) - awesome package, great and suitable for type hinting, compatible with `hypothesis`\n - [great talk](https://www.youtube.com/watch?v=PI5UmKi14cM)\n- [Tensorflow validate](https://www.tensorflow.org/tfx/guide/tfdv)\n- [DTF expectations](https://github.com/calogica/dbt-expectations)\n\n## Changes\n- [[changelog]]\n\n## Roadmap\n- [ ] Add tutorial Notebook\n- [ ] Support tableschema\n- [ ] Support Modin models\n- [ ] Support SQLAlchemy ORM models\n- [ ] Built-in Airflow Operator?\n- [ ] Interactive CLI/jupyter for schema generation",
"bugtrack_url": null,
"license": "",
"summary": "lightweight pandas.DataFrame schema",
"version": "0.0.11",
"project_urls": null,
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "a6cf0d985bee3872b7f059d7aac45d2d783e82324bbab58e7fc1ad3a503fccc5",
"md5": "b0dd8081c36a424081e2ec27121e1830",
"sha256": "87f8b291d86298942c6358e5f8df35f682ecb5680599c331ef2b4dc9fc8ca0bc"
},
"downloads": -1,
"filename": "dfschema-0.0.11-py3-none-any.whl",
"has_sig": false,
"md5_digest": "b0dd8081c36a424081e2ec27121e1830",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7.1,<4.0",
"size": 21930,
"upload_time": "2023-06-14T18:17:51",
"upload_time_iso_8601": "2023-06-14T18:17:51.530650Z",
"url": "https://files.pythonhosted.org/packages/a6/cf/0d985bee3872b7f059d7aac45d2d783e82324bbab58e7fc1ad3a503fccc5/dfschema-0.0.11-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "ef0a430df3674e777cf289342b558340ed971ed381b48374481d54247c2c833b",
"md5": "6dadfbd9d4d5c33c9cfc2b993eb02683",
"sha256": "748b0bed3f47e43cb52361454aac2b83c356ff439c2bcf9c40486d747f9a318b"
},
"downloads": -1,
"filename": "dfschema-0.0.11.tar.gz",
"has_sig": false,
"md5_digest": "6dadfbd9d4d5c33c9cfc2b993eb02683",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7.1,<4.0",
"size": 18151,
"upload_time": "2023-06-14T18:17:52",
"upload_time_iso_8601": "2023-06-14T18:17:52.652123Z",
"url": "https://files.pythonhosted.org/packages/ef/0a/430df3674e777cf289342b558340ed971ed381b48374481d54247c2c833b/dfschema-0.0.11.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-06-14 18:17:52",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "dfschema"
}