
Namepandera JSON
Version 0.22.1 PyPI version JSON
SummaryA light-weight and flexible data validation and testing tool for statistical data objects.
upload_time2024-12-26 21:21:31
authorNiels Bantilan
keywords pandas validation data-structures
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage
<div align="center"><a href=""><img src="docs/source/_static/pandera-banner.png" width="400"></a></div>

<h1 align="center">
  The Open-source Framework for Precision Data Testing

<p align="center">
  📊 🔎 ✅

<p align="center">
  <i>Data validation for scientists, engineers, and analysts seeking correctness.</i>


[![CI Build](](
[![Documentation Status](](
[![PyPI version](](
[![PyPI license](](
[![Project Status: Active – The project has reached a stable, usable state and is being actively developed.](](
[![Documentation Status](](
[![PyPI pyversions](](
[![Monthly Downloads](](
[![Total Downloads](](
[![Conda Downloads](](

`pandera` is a []( open
source project that provides a flexible and expressive API for performing data
validation on dataframe-like objects to make data processing pipelines more readable and robust.

Dataframes contain information that `pandera` explicitly validates at runtime.
This is useful in production-critical or reproducible research settings. With
`pandera`, you can:

1. Define a schema once and use it to validate
   [different dataframe types](
   including [pandas](, [polars](,
   [dask](, [modin](,
   and [pyspark](
1. [Check]( the types and
   properties of columns in a `DataFrame` or values in a `Series`.
1. Perform more complex statistical validation like
   [hypothesis testing](
1. [Parse]( data to standardize
   the preprocessing steps needed to produce valid data.
1. Seamlessly integrate with existing data analysis/processing pipelines
   via [function decorators](
1. Define dataframe models with the
   [class-based API](
   with pydantic-style syntax and validate dataframes using the typing syntax.
1. [Synthesize data](
   from schema objects for property-based testing with pandas data structures.
1. [Lazily Validate](
   dataframes so that all validation checks are executed before raising an error.
1. [Integrate]( with
   a rich ecosystem of python tools like [pydantic](,
   [fastapi](, and [mypy](

## Documentation

The official documentation is hosted here:

## Install

Using pip:

pip install pandera

Using conda:

conda install -c conda-forge pandera

### Extras

Installing additional functionality:



pip install 'pandera[hypotheses]' # hypothesis checks
pip install 'pandera[io]'         # yaml/script schema io utilities
pip install 'pandera[strategies]' # data synthesis strategies
pip install 'pandera[mypy]'       # enable static type-linting of pandas
pip install 'pandera[fastapi]'    # fastapi integration
pip install 'pandera[dask]'       # validate dask dataframes
pip install 'pandera[pyspark]'    # validate pyspark dataframes
pip install 'pandera[modin]'      # validate modin dataframes
pip install 'pandera[modin-ray]'  # validate modin dataframes with ray
pip install 'pandera[modin-dask]' # validate modin dataframes with dask
pip install 'pandera[geopandas]'  # validate geopandas geodataframes
pip install 'pandera[polars]'     # validate polars dataframes




conda install -c conda-forge pandera-hypotheses  # hypothesis checks
conda install -c conda-forge pandera-io          # yaml/script schema io utilities
conda install -c conda-forge pandera-strategies  # data synthesis strategies
conda install -c conda-forge pandera-mypy        # enable static type-linting of pandas
conda install -c conda-forge pandera-fastapi     # fastapi integration
conda install -c conda-forge pandera-dask        # validate dask dataframes
conda install -c conda-forge pandera-pyspark     # validate pyspark dataframes
conda install -c conda-forge pandera-modin       # validate modin dataframes
conda install -c conda-forge pandera-modin-ray   # validate modin dataframes with ray
conda install -c conda-forge pandera-modin-dask  # validate modin dataframes with dask
conda install -c conda-forge pandera-geopandas   # validate geopandas geodataframes
conda install -c conda-forge pandera-polars      # validate polars dataframes


## Quick Start

import pandas as pd
import pandera as pa

# data to validate
df = pd.DataFrame({
    "column1": [1, 4, 0, 10, 9],
    "column2": [-1.3, -1.4, -2.9, -10.1, -20.4],
    "column3": ["value_1", "value_2", "value_3", "value_2", "value_1"]

# define schema
schema = pa.DataFrameSchema({
    "column1": pa.Column(int, checks=pa.Check.le(10)),
    "column2": pa.Column(float,,
    "column3": pa.Column(str, checks=[
        # define custom checks as functions that take a series as input and
        # outputs a boolean or boolean Series
        pa.Check(lambda s: s.str.split("_", expand=True).shape[1] == 2)

validated_df = schema(df)

#     column1  column2  column3
#  0        1     -1.3  value_1
#  1        4     -1.4  value_2
#  2        0     -2.9  value_3
#  3       10    -10.1  value_2
#  4        9    -20.4  value_1

## DataFrame Model

`pandera` also provides an alternative API for expressing schemas inspired
by [dataclasses]( and
[pydantic]( The equivalent `DataFrameModel`
for the above `DataFrameSchema` would be:

from pandera.typing import Series

class Schema(pa.DataFrameModel):

    column1: int = pa.Field(le=10)
    column2: float = pa.Field(lt=-1.2)
    column3: str = pa.Field(str_startswith="value_")

    def column_3_check(cls, series: Series[str]) -> Series[bool]:
        """Check that values have two elements after being split with '_'"""
        return series.str.split("_", expand=True).shape[1] == 2


## Development Installation

git clone
cd pandera
export PYTHON_VERSION=...  # specify desired python version
pip install -r dev/requirements-${PYTHON_VERSION}.txt
pip install -e .

## Tests

pip install pytest
pytest tests

## Contributing to pandera [![GitHub contributors](](

All contributions, bug reports, bug fixes, documentation improvements,
enhancements and ideas are welcome.

A detailed overview on how to contribute can be found in the
[contributing guide](
on GitHub.

## Issues

Go [here]( to submit feature
requests or bugfixes.

## Need Help?

There are many ways of getting help with your questions. You can ask a question
on [Github Discussions](
page or reach out to the maintainers and pandera community on

## Why `pandera`?

- [dataframe-centric data types](,
  [column nullability](,
  and [uniqueness](
  are first-class concepts.
- Define [dataframe models]( with the class-based API with
  [pydantic]( syntax and validate dataframes using the typing syntax.
- `check_input` and `check_output` [decorators](
  enable seamless integration with existing code.
- [`Check`s]( provide flexibility and performance by providing access to `pandas`
  API by design and offers built-in checks for common data tests.
- [`Hypothesis`]( class provides a tidy-first interface for statistical hypothesis
- `Check`s and `Hypothesis` objects support both [tidy and wide data validation](
- Use schemas as generative contracts to [synthesize data]( for unit testing.
- [Schema inference]( allows you to bootstrap schemas from data.

## How to Cite

If you use `pandera` in the context of academic or industry research, please
consider citing the **paper** and/or **software package**.

### [Paper](

@InProceedings{ niels_bantilan-proc-scipy-2020,
  author    = { {N}iels {B}antilan },
  title     = { pandera: {S}tatistical {D}ata {V}alidation of {P}andas {D}ataframes },
  booktitle = { {P}roceedings of the 19th {P}ython in {S}cience {C}onference },
  pages     = { 116 - 124 },
  year      = { 2020 },
  editor    = { {M}eghann {A}garwal and {C}hris {C}alloway and {D}illon {N}iederhut and {D}avid {S}hupe },
  doi       = { 10.25080/Majora-342d178e-010 }

### Software Package


## License and Credits

`pandera` is licensed under the [MIT license](license.txt) and is written and
maintained by Niels Bantilan (


Raw data

    "_id": null,
    "home_page": "",
    "name": "pandera",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": null,
    "keywords": "pandas, validation, data-structures",
    "author": "Niels Bantilan",
    "author_email": "",
    "download_url": "",
    "platform": "any",
    "description": "<br>\n<div align=\"center\"><a href=\"\"><img src=\"docs/source/_static/pandera-banner.png\" width=\"400\"></a></div>\n\n<h1 align=\"center\">\n  The Open-source Framework for Precision Data Testing\n</h1>\n\n<p align=\"center\">\n  \ud83d\udcca \ud83d\udd0e \u2705\n</p>\n\n<p align=\"center\">\n  <i>Data validation for scientists, engineers, and analysts seeking correctness.</i>\n</p>\n\n<br>\n\n\n[![CI Build](](\n[![Documentation Status](](\n[![PyPI version](](\n[![PyPI license](](\n[![pyOpenSci](](\n[![Project Status: Active \u2013 The project has reached a stable, usable state and is being actively developed.](](\n[![Documentation Status](](\n[![codecov](](\n[![PyPI pyversions](](\n[![DOI](](\n[![asv](](\n[![Monthly Downloads](](\n[![Total Downloads](](\n[![Conda Downloads](](\n[![Discord](](\n\n`pandera` is a []( open\nsource project that provides a flexible and expressive API for performing data\nvalidation on dataframe-like objects to make data processing pipelines more readable and robust.\n\nDataframes contain information that `pandera` explicitly validates at runtime.\nThis is useful in production-critical or reproducible research settings. With\n`pandera`, you can:\n\n1. Define a schema once and use it to validate\n   [different dataframe types](\n   including [pandas](, [polars](,\n   [dask](, [modin](,\n   and [pyspark](\n1. [Check]( the types and\n   properties of columns in a `DataFrame` or values in a `Series`.\n1. Perform more complex statistical validation like\n   [hypothesis testing](\n1. [Parse]( data to standardize\n   the preprocessing steps needed to produce valid data.\n1. Seamlessly integrate with existing data analysis/processing pipelines\n   via [function decorators](\n1. Define dataframe models with the\n   [class-based API](\n   with pydantic-style syntax and validate dataframes using the typing syntax.\n1. [Synthesize data](\n   from schema objects for property-based testing with pandas data structures.\n1. [Lazily Validate](\n   dataframes so that all validation checks are executed before raising an error.\n1. [Integrate]( with\n   a rich ecosystem of python tools like [pydantic](,\n   [fastapi](, and [mypy](\n\n## Documentation\n\nThe official documentation is hosted here:\n\n\n## Install\n\nUsing pip:\n\n```\npip install pandera\n```\n\nUsing conda:\n\n```\nconda install -c conda-forge pandera\n```\n\n### Extras\n\nInstalling additional functionality:\n\n<details>\n\n<summary><i>pip</i></summary>\n\n```bash\npip install 'pandera[hypotheses]' # hypothesis checks\npip install 'pandera[io]'         # yaml/script schema io utilities\npip install 'pandera[strategies]' # data synthesis strategies\npip install 'pandera[mypy]'       # enable static type-linting of pandas\npip install 'pandera[fastapi]'    # fastapi integration\npip install 'pandera[dask]'       # validate dask dataframes\npip install 'pandera[pyspark]'    # validate pyspark dataframes\npip install 'pandera[modin]'      # validate modin dataframes\npip install 'pandera[modin-ray]'  # validate modin dataframes with ray\npip install 'pandera[modin-dask]' # validate modin dataframes with dask\npip install 'pandera[geopandas]'  # validate geopandas geodataframes\npip install 'pandera[polars]'     # validate polars dataframes\n```\n\n</details>\n\n<details>\n\n<summary><i>conda</i></summary>\n\n```bash\nconda install -c conda-forge pandera-hypotheses  # hypothesis checks\nconda install -c conda-forge pandera-io          # yaml/script schema io utilities\nconda install -c conda-forge pandera-strategies  # data synthesis strategies\nconda install -c conda-forge pandera-mypy        # enable static type-linting of pandas\nconda install -c conda-forge pandera-fastapi     # fastapi integration\nconda install -c conda-forge pandera-dask        # validate dask dataframes\nconda install -c conda-forge pandera-pyspark     # validate pyspark dataframes\nconda install -c conda-forge pandera-modin       # validate modin dataframes\nconda install -c conda-forge pandera-modin-ray   # validate modin dataframes with ray\nconda install -c conda-forge pandera-modin-dask  # validate modin dataframes with dask\nconda install -c conda-forge pandera-geopandas   # validate geopandas geodataframes\nconda install -c conda-forge pandera-polars      # validate polars dataframes\n```\n\n</details>\n\n## Quick Start\n\n```python\nimport pandas as pd\nimport pandera as pa\n\n\n# data to validate\ndf = pd.DataFrame({\n    \"column1\": [1, 4, 0, 10, 9],\n    \"column2\": [-1.3, -1.4, -2.9, -10.1, -20.4],\n    \"column3\": [\"value_1\", \"value_2\", \"value_3\", \"value_2\", \"value_1\"]\n})\n\n# define schema\nschema = pa.DataFrameSchema({\n    \"column1\": pa.Column(int, checks=pa.Check.le(10)),\n    \"column2\": pa.Column(float,,\n    \"column3\": pa.Column(str, checks=[\n        pa.Check.str_startswith(\"value_\"),\n        # define custom checks as functions that take a series as input and\n        # outputs a boolean or boolean Series\n        pa.Check(lambda s: s.str.split(\"_\", expand=True).shape[1] == 2)\n    ]),\n})\n\nvalidated_df = schema(df)\nprint(validated_df)\n\n#     column1  column2  column3\n#  0        1     -1.3  value_1\n#  1        4     -1.4  value_2\n#  2        0     -2.9  value_3\n#  3       10    -10.1  value_2\n#  4        9    -20.4  value_1\n```\n\n## DataFrame Model\n\n`pandera` also provides an alternative API for expressing schemas inspired\nby [dataclasses]( and\n[pydantic]( The equivalent `DataFrameModel`\nfor the above `DataFrameSchema` would be:\n\n\n```python\nfrom pandera.typing import Series\n\nclass Schema(pa.DataFrameModel):\n\n    column1: int = pa.Field(le=10)\n    column2: float = pa.Field(lt=-1.2)\n    column3: str = pa.Field(str_startswith=\"value_\")\n\n    @pa.check(\"column3\")\n    def column_3_check(cls, series: Series[str]) -> Series[bool]:\n        \"\"\"Check that values have two elements after being split with '_'\"\"\"\n        return series.str.split(\"_\", expand=True).shape[1] == 2\n\nSchema.validate(df)\n```\n\n## Development Installation\n\n```\ngit clone\ncd pandera\nexport PYTHON_VERSION=...  # specify desired python version\npip install -r dev/requirements-${PYTHON_VERSION}.txt\npip install -e .\n```\n\n## Tests\n\n```\npip install pytest\npytest tests\n```\n\n## Contributing to pandera [![GitHub contributors](](\n\nAll contributions, bug reports, bug fixes, documentation improvements,\nenhancements and ideas are welcome.\n\nA detailed overview on how to contribute can be found in the\n[contributing guide](\non GitHub.\n\n## Issues\n\nGo [here]( to submit feature\nrequests or bugfixes.\n\n## Need Help?\n\nThere are many ways of getting help with your questions. You can ask a question\non [Github Discussions](\npage or reach out to the maintainers and pandera community on\n[Discord](\n\n## Why `pandera`?\n\n- [dataframe-centric data types](,\n  [column nullability](,\n  and [uniqueness](\n  are first-class concepts.\n- Define [dataframe models]( with the class-based API with\n  [pydantic]( syntax and validate dataframes using the typing syntax.\n- `check_input` and `check_output` [decorators](\n  enable seamless integration with existing code.\n- [`Check`s]( provide flexibility and performance by providing access to `pandas`\n  API by design and offers built-in checks for common data tests.\n- [`Hypothesis`]( class provides a tidy-first interface for statistical hypothesis\n  testing.\n- `Check`s and `Hypothesis` objects support both [tidy and wide data validation](\n- Use schemas as generative contracts to [synthesize data]( for unit testing.\n- [Schema inference]( allows you to bootstrap schemas from data.\n\n## How to Cite\n\nIf you use `pandera` in the context of academic or industry research, please\nconsider citing the **paper** and/or **software package**.\n\n### [Paper](\n\n```\n@InProceedings{ niels_bantilan-proc-scipy-2020,\n  author    = { {N}iels {B}antilan },\n  title     = { pandera: {S}tatistical {D}ata {V}alidation of {P}andas {D}ataframes },\n  booktitle = { {P}roceedings of the 19th {P}ython in {S}cience {C}onference },\n  pages     = { 116 - 124 },\n  year      = { 2020 },\n  editor    = { {M}eghann {A}garwal and {C}hris {C}alloway and {D}illon {N}iederhut and {D}avid {S}hupe },\n  doi       = { 10.25080/Majora-342d178e-010 }\n}\n```\n\n### Software Package\n\n[![DOI](](\n\n\n## License and Credits\n\n`pandera` is licensed under the [MIT license](license.txt) and is written and\nmaintained by Niels Bantilan (\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A light-weight and flexible data validation and testing tool for statistical data objects.",
    "version": "0.22.1",
    "project_urls": {
        "Documentation": "",
        "Homepage": "",
        "Issue Tracker": ""
    "split_keywords": [
        " validation",
        " data-structures"
    "urls": [
            "comment_text": "",
            "digests": {
                "blake2b_256": "c7b8ab3e1419c0f3472c23ec7323da4bc970c3620055c9475d6fe3aa1dcdcef7",
                "md5": "5d816d7e30c29a53d778ed18223db316",
                "sha256": "2a35531b4b533ac83e606a6dcc3cd41561774ff3d872117228e931f22e72f330"
            "downloads": -1,
            "filename": "pandera-0.22.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "5d816d7e30c29a53d778ed18223db316",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 261370,
            "upload_time": "2024-12-26T21:21:29",
            "upload_time_iso_8601": "2024-12-26T21:21:29.160194Z",
            "url": "",
            "yanked": false,
            "yanked_reason": null
            "comment_text": "",
            "digests": {
                "blake2b_256": "13f2923e5a06f0f70e28de41eca58c89e4f87b00aa7c17a3436cc09592756c43",
                "md5": "1a129fffaa8f9ca4cdb19004038da5b1",
                "sha256": "091ebc353383ba642e5a20ee0df763ed2059ab99cb4b2ac3e83f482de8493645"
            "downloads": -1,
            "filename": "pandera-0.22.1.tar.gz",
            "has_sig": false,
            "md5_digest": "1a129fffaa8f9ca4cdb19004038da5b1",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 203085,
            "upload_time": "2024-12-26T21:21:31",
            "upload_time_iso_8601": "2024-12-26T21:21:31.996213Z",
            "url": "",
            "yanked": false,
            "yanked_reason": null
    "upload_time": "2024-12-26 21:21:31",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "pandera-dev",
    "github_project": "pandera",
    "travis_ci": false,
    "coveralls": true,
    "github_actions": true,
    "lcname": "pandera"
Elapsed time: 0.74826s