![pypi](https://img.shields.io/pypi/v/skippa)
![python versions](https://img.shields.io/pypi/pyversions/skippa)
![downloads](https://img.shields.io/pypi/dm/skippa)
![Build status](https://img.shields.io/azure-devops/build/data-science-lab/Intern/263)
<br><br>
<img src="skippa-logo-transparent.png" alt="logo" width="200"/>
# Skippa
SciKIt-learn Pre-processing Pipeline in PAndas
> __*Read more in the [introduction blog on towardsdatascience](https://towardsdatascience.com/introducing-skippa-bab260acf6a7)*__
Want to create a machine learning model using pandas & scikit-learn? This should make your life easier.
Skippa helps you to easily create a pre-processing and modeling pipeline, based on scikit-learn transformers but preserving pandas dataframe format throughout all pre-processing. This makes it a lot easier to define a series of subsequent transformation steps, while referring to columns in your intermediate dataframe.
So basically the same idea as `scikit-pandas`, but a different (and hopefully better) way to achieve it.
- [pypi](https://pypi.org/project/skippa/)
- [Documentation](https://skippa.readthedocs.io/)
## Installation
```
pip install skippa
```
Optional, if you want to use the [gradio app functionality](./examples/04-gradio-app.py):
```
pip install skippa[gradio]
```
## Basic usage
Import `Skippa` class and `columns` helper function
```
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from skippa import Skippa, columns
```
Get some data
```
df = pd.DataFrame({
'q': [0, 0, 0],
'date': ['2021-11-29', '2021-12-01', '2021-12-03'],
'x': ['a', 'b', 'c'],
'x2': ['m', 'n', 'm'],
'y': [1, 16, 1000],
'z': [0.4, None, 8.7]
})
y = np.array([0, 0, 1])
```
Define your pipeline:
```
pipe = (
Skippa()
.select(columns(['x', 'x2', 'y', 'z']))
.cast(columns(['x', 'x2']), 'category')
.impute(columns(dtype_include='number'), strategy='median')
.impute(columns(dtype_include='category'), strategy='most_frequent')
.scale(columns(dtype_include='number'), type='standard')
.onehot(columns(['x', 'x2']))
.model(LogisticRegression())
)
```
and use it for fitting / predicting like this:
```
pipe.fit(X=df, y=y)
predictions = pipe.predict_proba(df)
```
If you want details on your model, use:
```
model = pipe.get_model()
print(model.coef_)
print(model.intercept_)
```
## (de)serialization
And of course you can save and load your model pipelines (for deployment).
N.B. [`dill`](https://pypi.org/project/dill/) is used for ser/de because joblib and pickle don't provide enough support.
```
pipe.save('./models/my_skippa_model_pipeline.dill')
...
my_pipeline = Skippa.load_pipeline('./models/my_skippa_model_pipeline.dill')
predictions = my_pipeline.predict(df_new_data)
```
See the [./examples](./examples) directory for more examples:
- [01-standard-pipeline.py](./examples/01-standard-pipeline.py)
- [02-preprocessing-only.py](./examples/02-preprocessing-only.py)
- [03a-gridsearch.py](./examples/03a-gridsearch.py)
- [03b-hyperopt.py](./examples/03b-hyperopt.py)
- [04-gradio-app.py](./examples/04-gradio-app.py)
- [05-PCA.py](./examples/05-PCA.py)
## To Do
- [x] Support pandas assign for creating new columns based on existing columns
- [x] Support cast / astype transformer
- [x] Support for .apply transformer: wrapper around `pandas.DataFrame.apply`
- [x] Check how GridSearch (or other param search) works with Skippa
- [x] Add a method to inspect a fitted pipeline/model by creating a Gradio app defining raw features input and model output
- [x] Support PCA transformer
- [ ] Facilitate random seed in Skippa object that is dispatched to all downstream operations
- [ ] fit-transform does lazy evaluation > cast to category and then selecting category columns doesn't work > each fit/transform should work on the expected output state of the previous transformer, rather than on the original dataframe
- [ ] Investigate if Skippa can directly extend sklearn's Pipeline -> using __getitem__ trick
- [ ] Use sklearn's new dataframe output setting
- [ ] Validation of pipeline steps
- [ ] Input validation in transformers
- [ ] Transformer for replacing values (pandas .replace)
- [ ] Support arbitrary transformer (if column-preserving)
- [ ] Eliminate the need to call columns explicitly
## Credits
- Skippa is powered by [Data Science Lab Amsterdam](https://www.datasciencelab.nl)
- This project structure is based on the [`audreyr/cookiecutter-pypackage`](https://github.com/audreyr/cookiecutter-pypackage) project template.
# History
## 0.1.16 (2023-08-17)
- Bugfix: missing _replace_none attribute for SimpleImputer with strategy='constant'
## 0.1.15 (2022-11-18)
- Fix: when saving a pipeline, include dependencies in dill serialization.
## 0.1.14 (2022-05-13)
- Bugfix in .assign: shouldn't have columns
- Bugfix in imputer: explicit missing_values arg leads to issues
- Used space-titanic data in examples
- Logo added :)
## 0.1.13 (2022-04-08)
- Bugfix in imputer: using strategy='constant' threw a TypeError when used on string columns
## 0.1.12 (2022-02-07)
- Gradio & dependencies are not installed by default, but are declared an optional extra in setup
## 0.1.11 (2022-01-13)
- Example added for hyperparameter tuning with Hyperopt
## 0.1.10 (2021-12-28)
- Added support for PCA (including example)
- Gradio app support extended to regression
- Minor cleanup and improvements
## 0.1.9 (2021-12-24)
- Added support for automatic creation of Gradio app for model inspection
- Added example with Gradio app
## 0.1.8 (2021-12-23)
- Removed print statement in SkippaSimpleImputer
- Added unit tests
## 0.1.7 (2021-12-20)
- Fixed issue that GridSearchCV (or hyperparam in general) did not work on Skippa pipeline
- Example added using GridSearch
## 0.1.6 (2021-12-17)
- Docs, setup, readme updates
- Updated `.apply()` method so that is accepts a columns specifier
## 0.1.5 (2021-12-13)
- Fixes for readthedocs
## 0.1.4 (2021-12-13)
- Cleanup/fix in examples/full-pipeline.py
## 0.1.3 (2021-12-10)
- Added `.apply()` transformer for `pandas.DataFrame.apply()` functionality
- Documentation and examples update
## 0.1.2 (2021-11-28)
- Added `.assign()` transformer for `pandas.DataFrame.assign()` functionality
- Added `.cast()` transformer (with aliases `.astype()` & `.as_type()`) for `pandas.DataFrame.astype` functionality
## 0.1.1 (2021-11-22)
- Fixes and documentation.
## 0.1.0 (2021-11-19)
- First release on PyPI.
Raw data
{
"_id": null,
"home_page": "https://github.com/data-science-lab-amsterdam/skippa",
"name": "skippa",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": "",
"keywords": "preprocessing pipeline pandas sklearn",
"author": "Robert van Straalen",
"author_email": "tech@datasciencelab.nl",
"download_url": "https://files.pythonhosted.org/packages/2d/8a/37f555cb954a731afb6a8d757aff53fbb75d06f650980e1db45bbd08d610/skippa-0.1.16.tar.gz",
"platform": null,
"description": "![pypi](https://img.shields.io/pypi/v/skippa)\n![python versions](https://img.shields.io/pypi/pyversions/skippa)\n![downloads](https://img.shields.io/pypi/dm/skippa)\n![Build status](https://img.shields.io/azure-devops/build/data-science-lab/Intern/263)\n\n<br><br>\n<img src=\"skippa-logo-transparent.png\" alt=\"logo\" width=\"200\"/>\n\n# Skippa \n\nSciKIt-learn Pre-processing Pipeline in PAndas\n\n> __*Read more in the [introduction blog on towardsdatascience](https://towardsdatascience.com/introducing-skippa-bab260acf6a7)*__\n\n\n\nWant to create a machine learning model using pandas & scikit-learn? This should make your life easier.\n\nSkippa helps you to easily create a pre-processing and modeling pipeline, based on scikit-learn transformers but preserving pandas dataframe format throughout all pre-processing. This makes it a lot easier to define a series of subsequent transformation steps, while referring to columns in your intermediate dataframe.\n\nSo basically the same idea as `scikit-pandas`, but a different (and hopefully better) way to achieve it.\n\n- [pypi](https://pypi.org/project/skippa/)\n- [Documentation](https://skippa.readthedocs.io/)\n\n## Installation\n```\npip install skippa\n```\nOptional, if you want to use the [gradio app functionality](./examples/04-gradio-app.py):\n```\npip install skippa[gradio]\n```\n\n## Basic usage\n\nImport `Skippa` class and `columns` helper function\n```\nimport numpy as np\nimport pandas as pd\nfrom sklearn.linear_model import LogisticRegression\n\nfrom skippa import Skippa, columns\n```\n\nGet some data\n```\ndf = pd.DataFrame({\n 'q': [0, 0, 0],\n 'date': ['2021-11-29', '2021-12-01', '2021-12-03'],\n 'x': ['a', 'b', 'c'],\n 'x2': ['m', 'n', 'm'],\n 'y': [1, 16, 1000],\n 'z': [0.4, None, 8.7]\n})\ny = np.array([0, 0, 1])\n```\n\nDefine your pipeline:\n```\npipe = (\n Skippa()\n .select(columns(['x', 'x2', 'y', 'z']))\n .cast(columns(['x', 'x2']), 'category')\n .impute(columns(dtype_include='number'), strategy='median')\n .impute(columns(dtype_include='category'), strategy='most_frequent')\n .scale(columns(dtype_include='number'), type='standard')\n .onehot(columns(['x', 'x2']))\n .model(LogisticRegression())\n)\n```\n\nand use it for fitting / predicting like this:\n```\npipe.fit(X=df, y=y)\n\npredictions = pipe.predict_proba(df)\n```\n\nIf you want details on your model, use:\n```\nmodel = pipe.get_model()\nprint(model.coef_)\nprint(model.intercept_)\n```\n\n## (de)serialization\nAnd of course you can save and load your model pipelines (for deployment).\nN.B. [`dill`](https://pypi.org/project/dill/) is used for ser/de because joblib and pickle don't provide enough support.\n```\npipe.save('./models/my_skippa_model_pipeline.dill')\n\n...\n\nmy_pipeline = Skippa.load_pipeline('./models/my_skippa_model_pipeline.dill')\npredictions = my_pipeline.predict(df_new_data)\n```\n\nSee the [./examples](./examples) directory for more examples:\n- [01-standard-pipeline.py](./examples/01-standard-pipeline.py)\n- [02-preprocessing-only.py](./examples/02-preprocessing-only.py)\n- [03a-gridsearch.py](./examples/03a-gridsearch.py)\n- [03b-hyperopt.py](./examples/03b-hyperopt.py)\n- [04-gradio-app.py](./examples/04-gradio-app.py)\n- [05-PCA.py](./examples/05-PCA.py)\n\n## To Do\n- [x] Support pandas assign for creating new columns based on existing columns\n- [x] Support cast / astype transformer\n- [x] Support for .apply transformer: wrapper around `pandas.DataFrame.apply`\n- [x] Check how GridSearch (or other param search) works with Skippa\n- [x] Add a method to inspect a fitted pipeline/model by creating a Gradio app defining raw features input and model output\n- [x] Support PCA transformer\n- [ ] Facilitate random seed in Skippa object that is dispatched to all downstream operations\n- [ ] fit-transform does lazy evaluation > cast to category and then selecting category columns doesn't work > each fit/transform should work on the expected output state of the previous transformer, rather than on the original dataframe\n- [ ] Investigate if Skippa can directly extend sklearn's Pipeline -> using __getitem__ trick\n- [ ] Use sklearn's new dataframe output setting\n- [ ] Validation of pipeline steps\n- [ ] Input validation in transformers\n- [ ] Transformer for replacing values (pandas .replace)\n- [ ] Support arbitrary transformer (if column-preserving)\n- [ ] Eliminate the need to call columns explicitly\n\n\n## Credits\n- Skippa is powered by [Data Science Lab Amsterdam](https://www.datasciencelab.nl)\n- This project structure is based on the [`audreyr/cookiecutter-pypackage`](https://github.com/audreyr/cookiecutter-pypackage) project template.\n\n\n# History\n\n## 0.1.16 (2023-08-17)\n- Bugfix: missing _replace_none attribute for SimpleImputer with strategy='constant'\n\n## 0.1.15 (2022-11-18)\n- Fix: when saving a pipeline, include dependencies in dill serialization.\n\n## 0.1.14 (2022-05-13)\n- Bugfix in .assign: shouldn't have columns\n- Bugfix in imputer: explicit missing_values arg leads to issues\n- Used space-titanic data in examples\n- Logo added :)\n\n## 0.1.13 (2022-04-08)\n- Bugfix in imputer: using strategy='constant' threw a TypeError when used on string columns\n\n## 0.1.12 (2022-02-07)\n- Gradio & dependencies are not installed by default, but are declared an optional extra in setup\n\n## 0.1.11 (2022-01-13)\n- Example added for hyperparameter tuning with Hyperopt\n\n## 0.1.10 (2021-12-28)\n- Added support for PCA (including example)\n- Gradio app support extended to regression\n- Minor cleanup and improvements\n\n## 0.1.9 (2021-12-24)\n- Added support for automatic creation of Gradio app for model inspection\n- Added example with Gradio app\n\n## 0.1.8 (2021-12-23)\n- Removed print statement in SkippaSimpleImputer\n- Added unit tests\n\n## 0.1.7 (2021-12-20)\n- Fixed issue that GridSearchCV (or hyperparam in general) did not work on Skippa pipeline\n- Example added using GridSearch\n\n## 0.1.6 (2021-12-17)\n- Docs, setup, readme updates\n- Updated `.apply()` method so that is accepts a columns specifier\n\n## 0.1.5 (2021-12-13)\n- Fixes for readthedocs\n\n## 0.1.4 (2021-12-13)\n- Cleanup/fix in examples/full-pipeline.py\n\n## 0.1.3 (2021-12-10)\n- Added `.apply()` transformer for `pandas.DataFrame.apply()` functionality\n- Documentation and examples update\n\n## 0.1.2 (2021-11-28)\n- Added `.assign()` transformer for `pandas.DataFrame.assign()` functionality\n- Added `.cast()` transformer (with aliases `.astype()` & `.as_type()`) for `pandas.DataFrame.astype` functionality\n\n## 0.1.1 (2021-11-22)\n- Fixes and documentation.\n\n## 0.1.0 (2021-11-19)\n- First release on PyPI.\n\n\n",
"bugtrack_url": null,
"license": "",
"summary": "SciKIt-learn Pre-processing Pipeline in PAndas",
"version": "0.1.16",
"project_urls": {
"Documentation": "https://skippa.readthedocs.io/",
"Homepage": "https://github.com/data-science-lab-amsterdam/skippa"
},
"split_keywords": [
"preprocessing",
"pipeline",
"pandas",
"sklearn"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "a174159f75c28883048457de5b2d63ac98f73798c28c3f0687a72189a428e9fd",
"md5": "9f0f5c30ce142f4b86393cde83a540b2",
"sha256": "4865653843d9ec686a071fcff591e54b369383be6e1037b0e8b96bf7da2f5b5d"
},
"downloads": -1,
"filename": "skippa-0.1.16-py3-none-any.whl",
"has_sig": false,
"md5_digest": "9f0f5c30ce142f4b86393cde83a540b2",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 20620,
"upload_time": "2023-08-18T07:45:07",
"upload_time_iso_8601": "2023-08-18T07:45:07.397557Z",
"url": "https://files.pythonhosted.org/packages/a1/74/159f75c28883048457de5b2d63ac98f73798c28c3f0687a72189a428e9fd/skippa-0.1.16-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "2d8a37f555cb954a731afb6a8d757aff53fbb75d06f650980e1db45bbd08d610",
"md5": "e7348e4a17aed95ac9ec3c677a7e2e15",
"sha256": "fdf9f99f1d8dcb2d7d8c21831079bf20388c2ffcc5cf7ae19b019f83d56bc878"
},
"downloads": -1,
"filename": "skippa-0.1.16.tar.gz",
"has_sig": false,
"md5_digest": "e7348e4a17aed95ac9ec3c677a7e2e15",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 27174,
"upload_time": "2023-08-18T07:45:08",
"upload_time_iso_8601": "2023-08-18T07:45:08.889889Z",
"url": "https://files.pythonhosted.org/packages/2d/8a/37f555cb954a731afb6a8d757aff53fbb75d06f650980e1db45bbd08d610/skippa-0.1.16.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-08-18 07:45:08",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "data-science-lab-amsterdam",
"github_project": "skippa",
"travis_ci": true,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "pandas",
"specs": [
[
">=",
"1.0.0"
]
]
},
{
"name": "scikit-learn",
"specs": [
[
">=",
"1.0.0"
]
]
},
{
"name": "dill",
"specs": [
[
">=",
"0.3.4"
]
]
},
{
"name": "gradio",
"specs": [
[
">=",
"2.5.3"
]
]
}
],
"tox": true,
"lcname": "skippa"
}