<h1 align="center">
<b>Polars for Data Science</b>
<br>
</h1>
<p align="center">
<a href="https://polars-ds-extension.readthedocs.io/en/latest/">Documentation</a>
|
<a href="https://github.com/abstractqqq/polars_ds_extension/blob/main/examples/basics.ipynb">User Guide</a>
|
<a href="https://github.com/abstractqqq/polars_ds_extension/blob/main/CONTRIBUTING.md">Want to Contribute?</a>
<br>
<b>pip install polars-ds</b>
</p>
# The Project
The goal of the project is to **reduce dependencies**, **improve code organization**, **simplify data pipelines** and overall **faciliate analysis of various kinds of tabular data** that a data scientist may encounter. It is a package built around your favorite **Polars dataframe**. Here are some of the main areas of data science that is covered by the package:
1. Well-known numerical transform/quantities. E.g. fft, conditional entropy, singular values, basic linear regression related quantities, population stability index, weight of evidence, column-wise/row-wise jaccard similarity etc.
2. Statistics. Basic tests such as the t-test, f-test, KS statistics. Miscallaneous functions like weighted correlation, Xi-correlation. In-dataframe random column generations, etc.
3. Metrics. ML metrics for common model performance reporting. E.g ROC AUC for binary/multiclass classification, logloss, r2, MAPE, etc.
4. KNN-related queries. E.g. filter to k-nearest neighbors to point, find indices of all neighbors within a certain distance, etc.
5. String metrics such as Levenshtein distance, Damure Levenshtein distance, other string distances, snowball stemming (English only), string Jaccard similarity, etc.
6. Diagnosis. This modules contains the DIA (Data Inspection Assitant) class, which can help you profile your data, visualize data in lower dimensions, detect functional dependencies, detect other common data quality issues like null rate or high correlation. (Need plotly, great_tables, graphviz as optional dependencies.)
7. Sample. Traditional dataset sampling. No time series sampling yet. This module provides functionalities such as stratified downsample, volume neutral random sampling, etc.
8. Polars Native ML Pipeline. Planned but not started yet. The goal is to have a Polars native pipeline that can replace Scikit-learn's pipeline and provides all the benefits of Polars. All the basic transforms in Scikit-leran, categorical-encoders are planned. This can be super powerful together with Polars's expressions. (Basically, once you have expressions, you don't need to write custom transforms like col(A)/col(B), log transform, sqrt transform, linear/polynomial transforms, etc.)
Some other areas that currently exist, but is de-prioritized:
1. Complex number related queries.
2. Graph related queries. (The various representations of "Graphs" in tabular dataframe makes it hard to have consistent backend handling of such data.)
# But why? Why not use Sklearn? SciPy? NumPy?
The goal of the package is to **facilitate** data processes and analysis that go beyond standard SQL queries, and to **reduce** the number of dependencies in your project. It incorproates parts of SciPy, NumPy, Scikit-learn, and NLP (NLTK), etc., and treats them as Polars queries so that they can be run in parallel, in group_by contexts, all for almost no extra engineering effort.
Let's see an example. Say we want to generate a model performance report. In our data, we have segments. We are not only interested in the ROC AUC of our model on the entire dataset, but we are also interested in the model's performance on different segments.
```python
import polars as pl
import polars_ds as pds
size = 100_000
df = pl.DataFrame({
"a": np.random.random(size = size)
, "b": np.random.random(size = size)
, "x1" : range(size)
, "x2" : range(size, size + size)
, "y": range(-size, 0)
, "actual": np.round(np.random.random(size=size)).astype(np.int32)
, "predicted": np.random.random(size=size)
, "segments":["a"] * (size//2 + 100) + ["b"] * (size//2 - 100)
})
print(df.head())
shape: (5, 8)
┌──────────┬──────────┬─────┬────────┬─────────┬────────┬───────────┬──────────┐
│ a ┆ b ┆ x1 ┆ x2 ┆ y ┆ actual ┆ predicted ┆ segments │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ i64 ┆ i64 ┆ i64 ┆ i32 ┆ f64 ┆ str │
╞══════════╪══════════╪═════╪════════╪═════════╪════════╪═══════════╪══════════╡
│ 0.19483 ┆ 0.457516 ┆ 0 ┆ 100000 ┆ -100000 ┆ 0 ┆ 0.929007 ┆ a │
│ 0.396265 ┆ 0.833535 ┆ 1 ┆ 100001 ┆ -99999 ┆ 1 ┆ 0.103915 ┆ a │
│ 0.800558 ┆ 0.030437 ┆ 2 ┆ 100002 ┆ -99998 ┆ 1 ┆ 0.558918 ┆ a │
│ 0.608023 ┆ 0.411389 ┆ 3 ┆ 100003 ┆ -99997 ┆ 1 ┆ 0.883684 ┆ a │
│ 0.847527 ┆ 0.506504 ┆ 4 ┆ 100004 ┆ -99996 ┆ 1 ┆ 0.070269 ┆ a │
└──────────┴──────────┴─────┴────────┴─────────┴────────┴───────────┴──────────┘
```
Traditionally, using the Pandas + Sklearn stack, we would do:
```
import pandas as pd
from sklearn.metrics import roc_auc_score
df_pd = df.to_pandas()
segments = []
rocaucs = []
for (segment, subdf) in df_pd.groupby("segments"):
segments.append(segment)
rocaucs.append(
roc_auc_score(subdf["actual"], subdf["predicted"])
)
report = pd.DataFrame({
"segments": segments,
"roc_auc": rocaucs
})
print(report)
segments roc_auc
0 a 0.497745
1 b 0.498801
```
This is ok, but not great, because (1) we are running for loops in Python, which tends to be slow. (2) We are writing more Python code, which leaves more room for errors in bigger projects. (3) The code is not very intuitive for beginners. Using Polars + Polars ds, one can do the following:
```
df.lazy().group_by("segments").agg(
pds.query_roc_auc("actual", "predicted").alias("roc_auc"),
pds.query_log_loss("actual", "predicted").alias("log_loss"),
).collect()
shape: (2, 3)
┌──────────┬──────────┬──────────┐
│ segments ┆ roc_auc ┆ log_loss │
│ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ f64 │
╞══════════╪══════════╪══════════╡
│ a ┆ 0.497745 ┆ 1.006438 │
│ b ┆ 0.498801 ┆ 0.997226 │
└──────────┴──────────┴──────────┘
```
Notice a few things: (1) Computing ROC AUC on different segments is equivalent to an aggregation on segments! It is a concept everyone who knows SQL (aka everybody who works with data) will be familiar with! (2) There is no Python code. The extension is written in pure Rust and all complexities are hidden away from the end user. (3) Because Polars provides parallel execution for free, we can compute ROC AUC and log loss simultaneously on each segment! (In Pandas, one can do something like this in aggregations but is soooo much harder to write and way more confusing to reason about.)
The end result is simpler, more intuitive code that is also easier to reason about, and faster execution time. Because of Polars's extension (plugin) system, we are now blessed with both:
**Performance and elegance - something that is quite rare in the Python world.**
## Getting Started
```python
import polars_ds as pds
```
To make full use of the Diagnosis module, do
```python
pip install "polars_ds[plot]"
```
## Examples
See this for Polars Extensions: [notebook](./examples/basics.ipynb)
See this for Native Polars DataFrame Explorative tools: [notebook](./examples/diagnosis.ipynb)
# Disclaimer
**Currently in Beta. Feel free to submit feature requests in the issues section of the repo. This library will only depend on python Polars and will try to be as stable as possible for polars>=0.20.6. Exceptions will be made when Polars's update forces changes in the plugins.**
This package is not tested with Polars streaming mode and is not designed to work with data so big that has to be streamed.
The recommended usage will be for datasets of size 1k to 2-3mm rows, but actual performance will vary depending on dataset and hardware. Performance will only be a priority for datasets that fit in memory. It is a known fact that knn performance suffers greatly with a large k. Str-knn and Graph queries are only suitable for smaller data, of size ~1-5k for common computers.
# Credits
1. Rust Snowball Stemmer is taken from Tsoding's Seroost project (MIT). See [here](https://github.com/tsoding/seroost)
2. Some statistics functions are taken from Statrs (MIT) and internalized. See [here](https://github.com/statrs-dev/statrs/tree/master)
3. Graph functionalities are powered by the petgragh crate. See [here](https://crates.io/crates/petgraph)
4. Linear algebra routines are powered partly by [faer](https://crates.io/crates/faer)
# Other related Projects
1. Take a look at our friendly neighbor [functime](https://github.com/TracecatHQ/functime)
2. String similarity metrics is soooo fast and easy to use because of [RapidFuzz](https://github.com/maxbachmann/rapidfuzz-rs)
Raw data
{
"_id": null,
"home_page": null,
"name": "polars-ds-dg",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "polars-extension, scientific-computing, data-science",
"author": null,
"author_email": "Tianren Qin <tq9695@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/8e/8c/f39b49b87d7f3c2004166003eef60020ba6f39ddfbe5532f91799f06bdfd/polars_ds_dg-0.4.5.tar.gz",
"platform": null,
"description": "<h1 align=\"center\">\n <b>Polars for Data Science</b>\n <br>\n</h1>\n\n<p align=\"center\">\n <a href=\"https://polars-ds-extension.readthedocs.io/en/latest/\">Documentation</a>\n |\n <a href=\"https://github.com/abstractqqq/polars_ds_extension/blob/main/examples/basics.ipynb\">User Guide</a>\n |\n <a href=\"https://github.com/abstractqqq/polars_ds_extension/blob/main/CONTRIBUTING.md\">Want to Contribute?</a>\n<br>\n<b>pip install polars-ds</b>\n</p>\n\n# The Project\n\nThe goal of the project is to **reduce dependencies**, **improve code organization**, **simplify data pipelines** and overall **faciliate analysis of various kinds of tabular data** that a data scientist may encounter. It is a package built around your favorite **Polars dataframe**. Here are some of the main areas of data science that is covered by the package:\n\n1. Well-known numerical transform/quantities. E.g. fft, conditional entropy, singular values, basic linear regression related quantities, population stability index, weight of evidence, column-wise/row-wise jaccard similarity etc.\n\n2. Statistics. Basic tests such as the t-test, f-test, KS statistics. Miscallaneous functions like weighted correlation, Xi-correlation. In-dataframe random column generations, etc. \n\n3. Metrics. ML metrics for common model performance reporting. E.g ROC AUC for binary/multiclass classification, logloss, r2, MAPE, etc.\n\n4. KNN-related queries. E.g. filter to k-nearest neighbors to point, find indices of all neighbors within a certain distance, etc.\n\n5. String metrics such as Levenshtein distance, Damure Levenshtein distance, other string distances, snowball stemming (English only), string Jaccard similarity, etc.\n\n6. Diagnosis. This modules contains the DIA (Data Inspection Assitant) class, which can help you profile your data, visualize data in lower dimensions, detect functional dependencies, detect other common data quality issues like null rate or high correlation. (Need plotly, great_tables, graphviz as optional dependencies.)\n\n7. Sample. Traditional dataset sampling. No time series sampling yet. This module provides functionalities such as stratified downsample, volume neutral random sampling, etc.\n\n8. Polars Native ML Pipeline. Planned but not started yet. The goal is to have a Polars native pipeline that can replace Scikit-learn's pipeline and provides all the benefits of Polars. All the basic transforms in Scikit-leran, categorical-encoders are planned. This can be super powerful together with Polars's expressions. (Basically, once you have expressions, you don't need to write custom transforms like col(A)/col(B), log transform, sqrt transform, linear/polynomial transforms, etc.)\n\nSome other areas that currently exist, but is de-prioritized:\n\n1. Complex number related queries.\n\n2. Graph related queries. (The various representations of \"Graphs\" in tabular dataframe makes it hard to have consistent backend handling of such data.)\n\n# But why? Why not use Sklearn? SciPy? NumPy?\n\nThe goal of the package is to **facilitate** data processes and analysis that go beyond standard SQL queries, and to **reduce** the number of dependencies in your project. It incorproates parts of SciPy, NumPy, Scikit-learn, and NLP (NLTK), etc., and treats them as Polars queries so that they can be run in parallel, in group_by contexts, all for almost no extra engineering effort. \n\nLet's see an example. Say we want to generate a model performance report. In our data, we have segments. We are not only interested in the ROC AUC of our model on the entire dataset, but we are also interested in the model's performance on different segments.\n\n```python\nimport polars as pl\nimport polars_ds as pds\n\nsize = 100_000\ndf = pl.DataFrame({\n \"a\": np.random.random(size = size)\n , \"b\": np.random.random(size = size)\n , \"x1\" : range(size)\n , \"x2\" : range(size, size + size)\n , \"y\": range(-size, 0)\n , \"actual\": np.round(np.random.random(size=size)).astype(np.int32)\n , \"predicted\": np.random.random(size=size)\n , \"segments\":[\"a\"] * (size//2 + 100) + [\"b\"] * (size//2 - 100) \n})\nprint(df.head())\n\nshape: (5, 8)\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 a \u2506 b \u2506 x1 \u2506 x2 \u2506 y \u2506 actual \u2506 predicted \u2506 segments \u2502\n\u2502 --- \u2506 --- \u2506 --- \u2506 --- \u2506 --- \u2506 --- \u2506 --- \u2506 --- \u2502\n\u2502 f64 \u2506 f64 \u2506 i64 \u2506 i64 \u2506 i64 \u2506 i32 \u2506 f64 \u2506 str \u2502\n\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n\u2502 0.19483 \u2506 0.457516 \u2506 0 \u2506 100000 \u2506 -100000 \u2506 0 \u2506 0.929007 \u2506 a \u2502\n\u2502 0.396265 \u2506 0.833535 \u2506 1 \u2506 100001 \u2506 -99999 \u2506 1 \u2506 0.103915 \u2506 a \u2502\n\u2502 0.800558 \u2506 0.030437 \u2506 2 \u2506 100002 \u2506 -99998 \u2506 1 \u2506 0.558918 \u2506 a \u2502\n\u2502 0.608023 \u2506 0.411389 \u2506 3 \u2506 100003 \u2506 -99997 \u2506 1 \u2506 0.883684 \u2506 a \u2502\n\u2502 0.847527 \u2506 0.506504 \u2506 4 \u2506 100004 \u2506 -99996 \u2506 1 \u2506 0.070269 \u2506 a \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n```\n\nTraditionally, using the Pandas + Sklearn stack, we would do:\n\n```\nimport pandas as pd\nfrom sklearn.metrics import roc_auc_score\n\ndf_pd = df.to_pandas()\n\nsegments = []\nrocaucs = []\n\nfor (segment, subdf) in df_pd.groupby(\"segments\"):\n segments.append(segment)\n rocaucs.append(\n roc_auc_score(subdf[\"actual\"], subdf[\"predicted\"])\n )\n\nreport = pd.DataFrame({\n \"segments\": segments,\n \"roc_auc\": rocaucs\n})\nprint(report)\n\n segments roc_auc\n0 a 0.497745\n1 b 0.498801\n```\n\nThis is ok, but not great, because (1) we are running for loops in Python, which tends to be slow. (2) We are writing more Python code, which leaves more room for errors in bigger projects. (3) The code is not very intuitive for beginners. Using Polars + Polars ds, one can do the following:\n\n```\ndf.lazy().group_by(\"segments\").agg(\n pds.query_roc_auc(\"actual\", \"predicted\").alias(\"roc_auc\"),\n pds.query_log_loss(\"actual\", \"predicted\").alias(\"log_loss\"),\n).collect()\n\nshape: (2, 3)\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 segments \u2506 roc_auc \u2506 log_loss \u2502\n\u2502 --- \u2506 --- \u2506 --- \u2502\n\u2502 str \u2506 f64 \u2506 f64 \u2502\n\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n\u2502 a \u2506 0.497745 \u2506 1.006438 \u2502\n\u2502 b \u2506 0.498801 \u2506 0.997226 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n```\n\nNotice a few things: (1) Computing ROC AUC on different segments is equivalent to an aggregation on segments! It is a concept everyone who knows SQL (aka everybody who works with data) will be familiar with! (2) There is no Python code. The extension is written in pure Rust and all complexities are hidden away from the end user. (3) Because Polars provides parallel execution for free, we can compute ROC AUC and log loss simultaneously on each segment! (In Pandas, one can do something like this in aggregations but is soooo much harder to write and way more confusing to reason about.)\n\nThe end result is simpler, more intuitive code that is also easier to reason about, and faster execution time. Because of Polars's extension (plugin) system, we are now blessed with both:\n\n**Performance and elegance - something that is quite rare in the Python world.**\n\n## Getting Started\n\n```python\nimport polars_ds as pds\n```\n\nTo make full use of the Diagnosis module, do\n\n```python\npip install \"polars_ds[plot]\"\n```\n\n## Examples\n\nSee this for Polars Extensions: [notebook](./examples/basics.ipynb)\n\nSee this for Native Polars DataFrame Explorative tools: [notebook](./examples/diagnosis.ipynb)\n\n# Disclaimer\n\n**Currently in Beta. Feel free to submit feature requests in the issues section of the repo. This library will only depend on python Polars and will try to be as stable as possible for polars>=0.20.6. Exceptions will be made when Polars's update forces changes in the plugins.**\n\nThis package is not tested with Polars streaming mode and is not designed to work with data so big that has to be streamed. \n\nThe recommended usage will be for datasets of size 1k to 2-3mm rows, but actual performance will vary depending on dataset and hardware. Performance will only be a priority for datasets that fit in memory. It is a known fact that knn performance suffers greatly with a large k. Str-knn and Graph queries are only suitable for smaller data, of size ~1-5k for common computers.\n\n# Credits\n\n1. Rust Snowball Stemmer is taken from Tsoding's Seroost project (MIT). See [here](https://github.com/tsoding/seroost)\n2. Some statistics functions are taken from Statrs (MIT) and internalized. See [here](https://github.com/statrs-dev/statrs/tree/master)\n3. Graph functionalities are powered by the petgragh crate. See [here](https://crates.io/crates/petgraph)\n4. Linear algebra routines are powered partly by [faer](https://crates.io/crates/faer)\n\n# Other related Projects\n\n1. Take a look at our friendly neighbor [functime](https://github.com/TracecatHQ/functime)\n2. String similarity metrics is soooo fast and easy to use because of [RapidFuzz](https://github.com/maxbachmann/rapidfuzz-rs)\n",
"bugtrack_url": null,
"license": null,
"summary": null,
"version": "0.4.5",
"project_urls": null,
"split_keywords": [
"polars-extension",
" scientific-computing",
" data-science"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "5a406649944c72782fedbeb47cdd1ae00964057977d8f1bc858ff5a6e368b12d",
"md5": "8e2d837234e55946c9431da02b930a04",
"sha256": "03044c4bdf7b754e415f831da93946a29d774565ff76bd4f873205c58231348c"
},
"downloads": -1,
"filename": "polars_ds_dg-0.4.5-cp38-abi3-manylinux_2_34_x86_64.whl",
"has_sig": false,
"md5_digest": "8e2d837234e55946c9431da02b930a04",
"packagetype": "bdist_wheel",
"python_version": "cp38",
"requires_python": ">=3.8",
"size": 13801416,
"upload_time": "2024-05-13T20:25:35",
"upload_time_iso_8601": "2024-05-13T20:25:35.832601Z",
"url": "https://files.pythonhosted.org/packages/5a/40/6649944c72782fedbeb47cdd1ae00964057977d8f1bc858ff5a6e368b12d/polars_ds_dg-0.4.5-cp38-abi3-manylinux_2_34_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "8e8cf39b49b87d7f3c2004166003eef60020ba6f39ddfbe5532f91799f06bdfd",
"md5": "491982637d08e3c82253e6f631d010a2",
"sha256": "5ac4d05ee0f8c1fe5b93b5ed38a64a80f49d6b729916d40bac0bbbec2c140134"
},
"downloads": -1,
"filename": "polars_ds_dg-0.4.5.tar.gz",
"has_sig": false,
"md5_digest": "491982637d08e3c82253e6f631d010a2",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 2410232,
"upload_time": "2024-05-13T20:25:41",
"upload_time_iso_8601": "2024-05-13T20:25:41.170941Z",
"url": "https://files.pythonhosted.org/packages/8e/8c/f39b49b87d7f3c2004166003eef60020ba6f39ddfbe5532f91799f06bdfd/polars_ds_dg-0.4.5.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-05-13 20:25:41",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "polars-ds-dg"
}