dfschema


Namedfschema JSON
Version 0.0.11 PyPI version JSON
download
home_page
Summarylightweight pandas.DataFrame schema
upload_time2023-06-14 18:17:52
maintainer
docs_urlNone
authorPhilipp
requires_python>=3.7.1,<4.0
license
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # DFS (aka Dataframe_Schema)

**DFS** is a lightweight validator for `pandas.DataFrame`. You can think of it as a `jsonschema` for dataframe. 

Key features:
1. **Lightweight**: only dependent on `pandas`  and `pydantic` (which depends only on `typing_extensions`)
2. **Explicit**: inspired by `JsonSchema`, all schemas are stored as json (or yaml) files and can be generated or changed on the fly.
3. **Simple**: Easy to use, no need to change your workflow and dive into the implementation details. 
4. **Comprehensive**: Summarizes all errors in a single summary exception, checks for distributions, works on subsets of the dataframe 
5. **Rapid**: base schemas can be generated from given dataframe or sql query (using `pd.read_sql`).
6. **Handy**: Supports command line interface (with `[cli]` extra).
7. **Extendable**: Core idea is to validate *dataframes* of any type. While now supports only pandas, we'll add abstractions to run same checks on different types of dataframes (CuDF, Dask, SparkDF, etc )

## QuickStart

### 1. Validate DataFrame

Via wrapper
```python
import pandas as pd
import dfschema as dfs


df = pd.DataFrame({
  "a": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
  "b": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
})

schema_pass = {
  "shape": {"min_rows": 10}
}

schema_raise = {
  "shape": {"min_rows": 20}
}


dfs.validate(df, schema_pass)  # won't raise any issues
dfs.validate(df, schema_raise) # will Raise DataFrameSchemaError
```
Alternatively (v2 optional), you can use the root class, `DfSchema`:
```python
dfs.DfSchema.from_dict(schema_pass).validate(df)  # won't raise any issues
dfs.DfSchema.from_dict(schema_raise).validate(df)  # will Raise DataFrameSchemaError
```

### 2. Generate Schema

```python
dfs.DfSchema.from_df(df)
```
### 3. Read and Write Schemas
  
```python
schema = dfs.DfSchema.from_file('schema.json')
schema.to_file("schema.yml")
```

### 4. Using CLI
> Note: requires [cli] extra as relies on `Typer` and `click`

#### Validate via CLI
```shell
dfschema validate --read_kwargs_json '{delimiter="|"}' FILEPATH SCHEMA_FILEPATH
```
Supports
- csv
- xlsx
- parquet
- feather

#### Generate via CLI
```shell
dfs generate --format 'yaml' DATA_PATH > schema.yaml
```

## Installation

WIP

## Alternatives
- [TableScheme](https://pypi.org/project/tableschema/)
- [GreatExpectations](https://greatexpectations.io/). Large and complex package with Html reports, Airflow Operator, connectors, etc. an work on out-of-memory data, SQL databases, parquet, etc
- [Pandera](https://pandera.readthedocs.io/en/stable/) - awesome package, great and suitable for type hinting, compatible with `hypothesis`
  - [great talk](https://www.youtube.com/watch?v=PI5UmKi14cM)
- [Tensorflow validate](https://www.tensorflow.org/tfx/guide/tfdv)
- [DTF expectations](https://github.com/calogica/dbt-expectations)

## Changes
- [[changelog]]

## Roadmap
- [ ] Add tutorial Notebook
- [ ] Support tableschema
- [ ] Support Modin models
- [ ] Support SQLAlchemy ORM models
- [ ] Built-in Airflow Operator?
- [ ] Interactive CLI/jupyter for schema generation
            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "dfschema",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.7.1,<4.0",
    "maintainer_email": "",
    "keywords": "",
    "author": "Philipp",
    "author_email": "philippk@zillowgroup.com",
    "download_url": "https://files.pythonhosted.org/packages/ef/0a/430df3674e777cf289342b558340ed971ed381b48374481d54247c2c833b/dfschema-0.0.11.tar.gz",
    "platform": null,
    "description": "# DFS (aka Dataframe_Schema)\n\n**DFS** is a lightweight validator for `pandas.DataFrame`. You can think of it as a `jsonschema` for dataframe. \n\nKey features:\n1. **Lightweight**: only dependent on `pandas`  and `pydantic` (which depends only on `typing_extensions`)\n2. **Explicit**: inspired by `JsonSchema`, all schemas are stored as json (or yaml) files and can be generated or changed on the fly.\n3. **Simple**: Easy to use, no need to change your workflow and dive into the implementation details. \n4. **Comprehensive**: Summarizes all errors in a single summary exception, checks for distributions, works on subsets of the dataframe \n5. **Rapid**: base schemas can be generated from given dataframe or sql query (using `pd.read_sql`).\n6. **Handy**: Supports command line interface (with `[cli]` extra).\n7. **Extendable**: Core idea is to validate *dataframes* of any type. While now supports only pandas, we'll add abstractions to run same checks on different types of dataframes (CuDF, Dask, SparkDF, etc )\n\n## QuickStart\n\n### 1. Validate DataFrame\n\nVia wrapper\n```python\nimport pandas as pd\nimport dfschema as dfs\n\n\ndf = pd.DataFrame({\n  \"a\": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],\n  \"b\": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]\n})\n\nschema_pass = {\n  \"shape\": {\"min_rows\": 10}\n}\n\nschema_raise = {\n  \"shape\": {\"min_rows\": 20}\n}\n\n\ndfs.validate(df, schema_pass)  # won't raise any issues\ndfs.validate(df, schema_raise) # will Raise DataFrameSchemaError\n```\nAlternatively (v2 optional), you can use the root class, `DfSchema`:\n```python\ndfs.DfSchema.from_dict(schema_pass).validate(df)  # won't raise any issues\ndfs.DfSchema.from_dict(schema_raise).validate(df)  # will Raise DataFrameSchemaError\n```\n\n### 2. Generate Schema\n\n```python\ndfs.DfSchema.from_df(df)\n```\n### 3. Read and Write Schemas\n  \n```python\nschema = dfs.DfSchema.from_file('schema.json')\nschema.to_file(\"schema.yml\")\n```\n\n### 4. Using CLI\n> Note: requires [cli] extra as relies on `Typer` and `click`\n\n#### Validate via CLI\n```shell\ndfschema validate --read_kwargs_json '{delimiter=\"|\"}' FILEPATH SCHEMA_FILEPATH\n```\nSupports\n- csv\n- xlsx\n- parquet\n- feather\n\n#### Generate via CLI\n```shell\ndfs generate --format 'yaml' DATA_PATH > schema.yaml\n```\n\n## Installation\n\nWIP\n\n## Alternatives\n- [TableScheme](https://pypi.org/project/tableschema/)\n- [GreatExpectations](https://greatexpectations.io/). Large and complex package with Html reports, Airflow Operator, connectors, etc. an work on out-of-memory data, SQL databases, parquet, etc\n- [Pandera](https://pandera.readthedocs.io/en/stable/) - awesome package, great and suitable for type hinting, compatible with `hypothesis`\n  - [great talk](https://www.youtube.com/watch?v=PI5UmKi14cM)\n- [Tensorflow validate](https://www.tensorflow.org/tfx/guide/tfdv)\n- [DTF expectations](https://github.com/calogica/dbt-expectations)\n\n## Changes\n- [[changelog]]\n\n## Roadmap\n- [ ] Add tutorial Notebook\n- [ ] Support tableschema\n- [ ] Support Modin models\n- [ ] Support SQLAlchemy ORM models\n- [ ] Built-in Airflow Operator?\n- [ ] Interactive CLI/jupyter for schema generation",
    "bugtrack_url": null,
    "license": "",
    "summary": "lightweight pandas.DataFrame schema",
    "version": "0.0.11",
    "project_urls": null,
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a6cf0d985bee3872b7f059d7aac45d2d783e82324bbab58e7fc1ad3a503fccc5",
                "md5": "b0dd8081c36a424081e2ec27121e1830",
                "sha256": "87f8b291d86298942c6358e5f8df35f682ecb5680599c331ef2b4dc9fc8ca0bc"
            },
            "downloads": -1,
            "filename": "dfschema-0.0.11-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "b0dd8081c36a424081e2ec27121e1830",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7.1,<4.0",
            "size": 21930,
            "upload_time": "2023-06-14T18:17:51",
            "upload_time_iso_8601": "2023-06-14T18:17:51.530650Z",
            "url": "https://files.pythonhosted.org/packages/a6/cf/0d985bee3872b7f059d7aac45d2d783e82324bbab58e7fc1ad3a503fccc5/dfschema-0.0.11-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ef0a430df3674e777cf289342b558340ed971ed381b48374481d54247c2c833b",
                "md5": "6dadfbd9d4d5c33c9cfc2b993eb02683",
                "sha256": "748b0bed3f47e43cb52361454aac2b83c356ff439c2bcf9c40486d747f9a318b"
            },
            "downloads": -1,
            "filename": "dfschema-0.0.11.tar.gz",
            "has_sig": false,
            "md5_digest": "6dadfbd9d4d5c33c9cfc2b993eb02683",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7.1,<4.0",
            "size": 18151,
            "upload_time": "2023-06-14T18:17:52",
            "upload_time_iso_8601": "2023-06-14T18:17:52.652123Z",
            "url": "https://files.pythonhosted.org/packages/ef/0a/430df3674e777cf289342b558340ed971ed381b48374481d54247c2c833b/dfschema-0.0.11.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-06-14 18:17:52",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "dfschema"
}
        
Elapsed time: 1.38929s