pydatasummary


Namepydatasummary JSON
Version 0.0.2 PyPI version JSON
download
home_pagehttps://github.com/NKeleher/pydatasummary#readme
SummaryCustomizable data and model summaries in Python.
upload_time2024-01-24 19:38:42
maintainer
docs_urlNone
authorNiall Keleher
requires_python>=3.9,<4.0
licenseMIT
keywords tables statistics econometrics
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # pydatasummary

Customizable data and model summaries in Python.

`pydatasummary` creates tables that provide descriptive statistics of
numeric and categorical data.

The goal is to provide a simple -- yet customizable -- way to summarize
data and models in Python.

`pydatasummary` is heavily inspired by [`modelsummary`](https://modelsummary.com/)
in R. The goal is not to replicate all that `modelsummary` does, but to provide
a way of achieving similar results in Python.

In order to achieve this, `pydatasummary` builds on the [`polars`](https://docs.pola.rs/)
library to produce tables that can be easily customized and exported to other formats.

## Basic Usage

As an example of `pydatasummary` usage, the `skim` function provides a
summary of a DataFrame (either `polars.DataFrame` or `pandas.DataFrame`).
The default summary statistics returned by `pydatasummary.skim()` are unique values,
percentage missing, mean, standard deviation, minimum, median, and maximum.

Where possible, `pydatasummary` will print a table to the console and return a
polars DataFrame with the summary statistics. This allows for easy customization.
For example, the `polars.DataFrame` with statistics from `pydatasummary` can be
modified using the [`Great Tables`](https://posit-dev.github.io/great-tables/reference/) package.

```python
import polars as pl
import pydatasummary as ds

df = (
        pl.read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/datasets/mtcars.csv")
          .drop("rownames")
    )

stats = ds.skim(df)

Summary Statistics
Rows: 32, Columns: 11
┌──────┬────────────┬─────────────┬───────┬───────┬──────┬────────┬───────┐
│      ┆ Unique (#) ┆ Missing (%) ┆  Mean ┆    SD ┆  Min ┆ Median ┆   Max │
╞══════╪════════════╪═════════════╪═══════╪═══════╪══════╪════════╪═══════╡
│  mpg ┆         25 ┆         0.0 ┆  20.1 ┆   6.0 ┆ 10.4 ┆   19.2 ┆  33.9 │
│  cyl ┆          3 ┆         0.0 ┆   6.2 ┆   1.8 ┆  4.0 ┆    6.0 ┆   8.0 │
│ disp ┆         27 ┆         0.0 ┆ 230.7 ┆ 123.9 ┆ 71.1 ┆  196.3 ┆ 472.0 │
│   hp ┆         22 ┆         0.0 ┆ 146.7 ┆  68.6 ┆ 52.0 ┆  123.0 ┆ 335.0 │
│ drat ┆         22 ┆         0.0 ┆   3.6 ┆   0.5 ┆  2.8 ┆    3.7 ┆   4.9 │
│   wt ┆         29 ┆         0.0 ┆   3.2 ┆   1.0 ┆  1.5 ┆    3.3 ┆   5.4 │
│ qsec ┆         30 ┆         0.0 ┆  17.8 ┆   1.8 ┆ 14.5 ┆   17.7 ┆  22.9 │
│   vs ┆          2 ┆         0.0 ┆   0.4 ┆   0.5 ┆  0.0 ┆    0.0 ┆   1.0 │
│   am ┆          2 ┆         0.0 ┆   0.4 ┆   0.5 ┆  0.0 ┆    0.0 ┆   1.0 │
│ gear ┆          3 ┆         0.0 ┆   3.7 ┆   0.7 ┆  3.0 ┆    4.0 ┆   5.0 │
│ carb ┆          6 ┆         0.0 ┆   2.8 ┆   1.6 ┆  1.0 ┆    2.0 ┆   8.0 │
└──────┴────────────┴─────────────┴───────┴───────┴──────┴────────┴───────┘
```

We can achieve the same result above with a pandas DataFrame.

```python
import pandas as pd
import pydatasummary as ds

trees_df = pd.read_csv(
    "https://vincentarelbundock.github.io/Rdatasets/csv/datasets/trees.csv"
).drop(columns=["rownames"])

trees_stats = ds.skim(trees_df)

Summary Statistics
Rows: 31, Columns: 3
┌────────┬────────────┬─────────────┬──────┬──────┬──────┬────────┬──────┐
│        ┆ Unique (#) ┆ Missing (%) ┆ Mean ┆   SD ┆  Min ┆ Median ┆  Max │
╞════════╪════════════╪═════════════╪══════╪══════╪══════╪════════╪══════╡
│  Girth ┆         27 ┆         0.0 ┆ 13.2 ┆  3.1 ┆  8.3 ┆   12.9 ┆ 20.6 │
│ Height ┆         21 ┆         0.0 ┆ 76.0 ┆  6.4 ┆ 63.0 ┆   76.0 ┆ 87.0 │
│ Volume ┆         30 ┆         0.0 ┆ 30.2 ┆ 16.4 ┆ 10.2 ┆   24.2 ┆ 77.0 │
└────────┴────────────┴─────────────┴──────┴──────┴──────┴────────┴──────┘

```

## Contributing

If you encounter a bug, have usage questions, or want to share ideas to make
the `pydatasummary` package more useful, please feel free to file an
[issue](https://github.com/NKeleher/pydatasummary/issues).

## Code of Conduct

Please note that the **pydatasummary** project is released with a
[contributor code of conduct](https://www.contributor-covenant.org/version/2/1/code_of_conduct/).

By participating in this project you agree to abide by its terms.

## License

**pydatasummary** is licensed under the MIT license.

## Governance

This project is primarily maintained by [Niall Keleher](https://twitter.com/nkeleher).
Contributions from other authors is welcome.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/NKeleher/pydatasummary#readme",
    "name": "pydatasummary",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.9,<4.0",
    "maintainer_email": "",
    "keywords": "tables,statistics,econometrics",
    "author": "Niall Keleher",
    "author_email": "niall.keleher@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/8a/58/c062831e67889f8ded6b12f7ee03b09187ad222d3fdbfe462fdf3e594436/pydatasummary-0.0.2.tar.gz",
    "platform": null,
    "description": "# pydatasummary\n\nCustomizable data and model summaries in Python.\n\n`pydatasummary` creates tables that provide descriptive statistics of\nnumeric and categorical data.\n\nThe goal is to provide a simple -- yet customizable -- way to summarize\ndata and models in Python.\n\n`pydatasummary` is heavily inspired by [`modelsummary`](https://modelsummary.com/)\nin R. The goal is not to replicate all that `modelsummary` does, but to provide\na way of achieving similar results in Python.\n\nIn order to achieve this, `pydatasummary` builds on the [`polars`](https://docs.pola.rs/)\nlibrary to produce tables that can be easily customized and exported to other formats.\n\n## Basic Usage\n\nAs an example of `pydatasummary` usage, the `skim` function provides a\nsummary of a DataFrame (either `polars.DataFrame` or `pandas.DataFrame`).\nThe default summary statistics returned by `pydatasummary.skim()` are unique values,\npercentage missing, mean, standard deviation, minimum, median, and maximum.\n\nWhere possible, `pydatasummary` will print a table to the console and return a\npolars DataFrame with the summary statistics. This allows for easy customization.\nFor example, the `polars.DataFrame` with statistics from `pydatasummary` can be\nmodified using the [`Great Tables`](https://posit-dev.github.io/great-tables/reference/) package.\n\n```python\nimport polars as pl\nimport pydatasummary as ds\n\ndf = (\n        pl.read_csv(\"https://vincentarelbundock.github.io/Rdatasets/csv/datasets/mtcars.csv\")\n          .drop(\"rownames\")\n    )\n\nstats = ds.skim(df)\n\nSummary Statistics\nRows: 32, Columns: 11\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502      \u2506 Unique (#) \u2506 Missing (%) \u2506  Mean \u2506    SD \u2506  Min \u2506 Median \u2506   Max \u2502\n\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n\u2502  mpg \u2506         25 \u2506         0.0 \u2506  20.1 \u2506   6.0 \u2506 10.4 \u2506   19.2 \u2506  33.9 \u2502\n\u2502  cyl \u2506          3 \u2506         0.0 \u2506   6.2 \u2506   1.8 \u2506  4.0 \u2506    6.0 \u2506   8.0 \u2502\n\u2502 disp \u2506         27 \u2506         0.0 \u2506 230.7 \u2506 123.9 \u2506 71.1 \u2506  196.3 \u2506 472.0 \u2502\n\u2502   hp \u2506         22 \u2506         0.0 \u2506 146.7 \u2506  68.6 \u2506 52.0 \u2506  123.0 \u2506 335.0 \u2502\n\u2502 drat \u2506         22 \u2506         0.0 \u2506   3.6 \u2506   0.5 \u2506  2.8 \u2506    3.7 \u2506   4.9 \u2502\n\u2502   wt \u2506         29 \u2506         0.0 \u2506   3.2 \u2506   1.0 \u2506  1.5 \u2506    3.3 \u2506   5.4 \u2502\n\u2502 qsec \u2506         30 \u2506         0.0 \u2506  17.8 \u2506   1.8 \u2506 14.5 \u2506   17.7 \u2506  22.9 \u2502\n\u2502   vs \u2506          2 \u2506         0.0 \u2506   0.4 \u2506   0.5 \u2506  0.0 \u2506    0.0 \u2506   1.0 \u2502\n\u2502   am \u2506          2 \u2506         0.0 \u2506   0.4 \u2506   0.5 \u2506  0.0 \u2506    0.0 \u2506   1.0 \u2502\n\u2502 gear \u2506          3 \u2506         0.0 \u2506   3.7 \u2506   0.7 \u2506  3.0 \u2506    4.0 \u2506   5.0 \u2502\n\u2502 carb \u2506          6 \u2506         0.0 \u2506   2.8 \u2506   1.6 \u2506  1.0 \u2506    2.0 \u2506   8.0 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n```\n\nWe can achieve the same result above with a pandas DataFrame.\n\n```python\nimport pandas as pd\nimport pydatasummary as ds\n\ntrees_df = pd.read_csv(\n    \"https://vincentarelbundock.github.io/Rdatasets/csv/datasets/trees.csv\"\n).drop(columns=[\"rownames\"])\n\ntrees_stats = ds.skim(trees_df)\n\nSummary Statistics\nRows: 31, Columns: 3\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502        \u2506 Unique (#) \u2506 Missing (%) \u2506 Mean \u2506   SD \u2506  Min \u2506 Median \u2506  Max \u2502\n\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n\u2502  Girth \u2506         27 \u2506         0.0 \u2506 13.2 \u2506  3.1 \u2506  8.3 \u2506   12.9 \u2506 20.6 \u2502\n\u2502 Height \u2506         21 \u2506         0.0 \u2506 76.0 \u2506  6.4 \u2506 63.0 \u2506   76.0 \u2506 87.0 \u2502\n\u2502 Volume \u2506         30 \u2506         0.0 \u2506 30.2 \u2506 16.4 \u2506 10.2 \u2506   24.2 \u2506 77.0 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n\n```\n\n## Contributing\n\nIf you encounter a bug, have usage questions, or want to share ideas to make\nthe `pydatasummary` package more useful, please feel free to file an\n[issue](https://github.com/NKeleher/pydatasummary/issues).\n\n## Code of Conduct\n\nPlease note that the **pydatasummary** project is released with a\n[contributor code of conduct](https://www.contributor-covenant.org/version/2/1/code_of_conduct/).\n\nBy participating in this project you agree to abide by its terms.\n\n## License\n\n**pydatasummary** is licensed under the MIT license.\n\n## Governance\n\nThis project is primarily maintained by [Niall Keleher](https://twitter.com/nkeleher).\nContributions from other authors is welcome.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Customizable data and model summaries in Python.",
    "version": "0.0.2",
    "project_urls": {
        "Homepage": "https://github.com/NKeleher/pydatasummary#readme",
        "Repository": "https://github.com/NKeleher/pydatasummary"
    },
    "split_keywords": [
        "tables",
        "statistics",
        "econometrics"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "f95307a8e2fb5f833c4d733e87b06530b2224a24d60e8d0cf6d91c4aef9bee59",
                "md5": "15d9d662b68d131a8e878666126b6457",
                "sha256": "0c9478df893bda508557aa064001d0dd91fa99c1263dd634cf6f4fc1d960f152"
            },
            "downloads": -1,
            "filename": "pydatasummary-0.0.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "15d9d662b68d131a8e878666126b6457",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9,<4.0",
            "size": 11930,
            "upload_time": "2024-01-24T19:38:41",
            "upload_time_iso_8601": "2024-01-24T19:38:41.946952Z",
            "url": "https://files.pythonhosted.org/packages/f9/53/07a8e2fb5f833c4d733e87b06530b2224a24d60e8d0cf6d91c4aef9bee59/pydatasummary-0.0.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "8a58c062831e67889f8ded6b12f7ee03b09187ad222d3fdbfe462fdf3e594436",
                "md5": "374a625c56031899b80033737c28c7ee",
                "sha256": "715eedc07c789f035ae480999f7dd12eac87edba7a6bb4032e1e9ce1828ca8a3"
            },
            "downloads": -1,
            "filename": "pydatasummary-0.0.2.tar.gz",
            "has_sig": false,
            "md5_digest": "374a625c56031899b80033737c28c7ee",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9,<4.0",
            "size": 11398,
            "upload_time": "2024-01-24T19:38:42",
            "upload_time_iso_8601": "2024-01-24T19:38:42.977588Z",
            "url": "https://files.pythonhosted.org/packages/8a/58/c062831e67889f8ded6b12f7ee03b09187ad222d3fdbfe462fdf3e594436/pydatasummary-0.0.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-01-24 19:38:42",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "NKeleher",
    "github_project": "pydatasummary#readme",
    "github_not_found": true,
    "lcname": "pydatasummary"
}
        
Elapsed time: 0.37360s