skippa


Nameskippa JSON
Version 0.1.16 PyPI version JSON
download
home_pagehttps://github.com/data-science-lab-amsterdam/skippa
SummarySciKIt-learn Pre-processing Pipeline in PAndas
upload_time2023-08-18 07:45:08
maintainer
docs_urlNone
authorRobert van Straalen
requires_python>=3.7
license
keywords preprocessing pipeline pandas sklearn
VCS
bugtrack_url
requirements pandas scikit-learn dill gradio
Travis-CI
coveralls test coverage No coveralls.
            ![pypi](https://img.shields.io/pypi/v/skippa)
![python versions](https://img.shields.io/pypi/pyversions/skippa)
![downloads](https://img.shields.io/pypi/dm/skippa)
![Build status](https://img.shields.io/azure-devops/build/data-science-lab/Intern/263)

<br><br>
<img src="skippa-logo-transparent.png" alt="logo" width="200"/>

# Skippa 

SciKIt-learn Pre-processing Pipeline in PAndas

> __*Read more in the [introduction blog on towardsdatascience](https://towardsdatascience.com/introducing-skippa-bab260acf6a7)*__



Want to create a machine learning model using pandas & scikit-learn? This should make your life easier.

Skippa helps you to easily create a pre-processing and modeling pipeline, based on scikit-learn transformers but preserving pandas dataframe format throughout all pre-processing. This makes it a lot easier to define a series of subsequent transformation steps, while referring to columns in your intermediate dataframe.

So basically the same idea as `scikit-pandas`, but a different (and hopefully better) way to achieve it.

- [pypi](https://pypi.org/project/skippa/)
- [Documentation](https://skippa.readthedocs.io/)

## Installation
```
pip install skippa
```
Optional, if you want to use the [gradio app functionality](./examples/04-gradio-app.py):
```
pip install skippa[gradio]
```

## Basic usage

Import `Skippa` class and `columns` helper function
```
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression

from skippa import Skippa, columns
```

Get some data
```
df = pd.DataFrame({
    'q': [0, 0, 0],
    'date': ['2021-11-29', '2021-12-01', '2021-12-03'],
    'x': ['a', 'b', 'c'],
    'x2': ['m', 'n', 'm'],
    'y': [1, 16, 1000],
    'z': [0.4, None, 8.7]
})
y = np.array([0, 0, 1])
```

Define your pipeline:
```
pipe = (
    Skippa()
        .select(columns(['x', 'x2', 'y', 'z']))
        .cast(columns(['x', 'x2']), 'category')
        .impute(columns(dtype_include='number'), strategy='median')
        .impute(columns(dtype_include='category'), strategy='most_frequent')
        .scale(columns(dtype_include='number'), type='standard')
        .onehot(columns(['x', 'x2']))
        .model(LogisticRegression())
)
```

and use it for fitting / predicting like this:
```
pipe.fit(X=df, y=y)

predictions = pipe.predict_proba(df)
```

If you want details on your model, use:
```
model = pipe.get_model()
print(model.coef_)
print(model.intercept_)
```

## (de)serialization
And of course you can save and load your model pipelines (for deployment).
N.B. [`dill`](https://pypi.org/project/dill/) is used for ser/de because joblib and pickle don't provide enough support.
```
pipe.save('./models/my_skippa_model_pipeline.dill')

...

my_pipeline = Skippa.load_pipeline('./models/my_skippa_model_pipeline.dill')
predictions = my_pipeline.predict(df_new_data)
```

See the [./examples](./examples) directory for more examples:
- [01-standard-pipeline.py](./examples/01-standard-pipeline.py)
- [02-preprocessing-only.py](./examples/02-preprocessing-only.py)
- [03a-gridsearch.py](./examples/03a-gridsearch.py)
- [03b-hyperopt.py](./examples/03b-hyperopt.py)
- [04-gradio-app.py](./examples/04-gradio-app.py)
- [05-PCA.py](./examples/05-PCA.py)

## To Do
- [x] Support pandas assign for creating new columns based on existing columns
- [x] Support cast / astype transformer
- [x] Support for .apply transformer: wrapper around `pandas.DataFrame.apply`
- [x] Check how GridSearch (or other param search) works with Skippa
- [x] Add a method to inspect a fitted pipeline/model by creating a Gradio app defining raw features input and model output
- [x] Support PCA transformer
- [ ] Facilitate random seed in Skippa object that is dispatched to all downstream operations
- [ ] fit-transform does lazy evaluation > cast to category and then selecting category columns doesn't work > each fit/transform should work on the expected output state of the previous transformer, rather than on the original dataframe
- [ ] Investigate if Skippa can directly extend sklearn's Pipeline -> using __getitem__ trick
- [ ] Use sklearn's new dataframe output setting
- [ ] Validation of pipeline steps
- [ ] Input validation in transformers
- [ ] Transformer for replacing values (pandas .replace)
- [ ] Support arbitrary transformer (if column-preserving)
- [ ] Eliminate the need to call columns explicitly


## Credits
- Skippa is powered by [Data Science Lab Amsterdam](https://www.datasciencelab.nl)
- This project structure is based on the [`audreyr/cookiecutter-pypackage`](https://github.com/audreyr/cookiecutter-pypackage) project template.


# History

## 0.1.16 (2023-08-17)
- Bugfix: missing _replace_none attribute for SimpleImputer with strategy='constant'

## 0.1.15 (2022-11-18)
- Fix: when saving a pipeline, include dependencies in dill serialization.

## 0.1.14 (2022-05-13)
- Bugfix in .assign: shouldn't have columns
- Bugfix in imputer: explicit missing_values arg leads to issues
- Used space-titanic data in examples
- Logo added :)

## 0.1.13 (2022-04-08)
- Bugfix in imputer: using strategy='constant' threw a TypeError when used on string columns

## 0.1.12 (2022-02-07)
- Gradio & dependencies are not installed by default, but are declared an optional extra in setup

## 0.1.11 (2022-01-13)
- Example added for hyperparameter tuning with Hyperopt

## 0.1.10 (2021-12-28)
- Added support for PCA (including example)
- Gradio app support extended to regression
- Minor cleanup and improvements

## 0.1.9 (2021-12-24)
- Added support for automatic creation of Gradio app for model inspection
- Added example with Gradio app

## 0.1.8 (2021-12-23)
- Removed print statement in SkippaSimpleImputer
- Added unit tests

## 0.1.7 (2021-12-20)
- Fixed issue that GridSearchCV (or hyperparam in general) did not work on Skippa pipeline
- Example added using GridSearch

## 0.1.6 (2021-12-17)
- Docs, setup, readme updates
- Updated `.apply()` method so that is accepts a columns specifier

## 0.1.5 (2021-12-13)
- Fixes for readthedocs

## 0.1.4 (2021-12-13)
- Cleanup/fix in examples/full-pipeline.py

## 0.1.3 (2021-12-10)
- Added `.apply()` transformer for `pandas.DataFrame.apply()` functionality
- Documentation and examples update

## 0.1.2 (2021-11-28)
- Added `.assign()` transformer for `pandas.DataFrame.assign()` functionality
- Added `.cast()` transformer (with aliases `.astype()` & `.as_type()`) for `pandas.DataFrame.astype` functionality

## 0.1.1 (2021-11-22)
- Fixes and documentation.

## 0.1.0 (2021-11-19)
- First release on PyPI.



            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/data-science-lab-amsterdam/skippa",
    "name": "skippa",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": "",
    "keywords": "preprocessing pipeline pandas sklearn",
    "author": "Robert van Straalen",
    "author_email": "tech@datasciencelab.nl",
    "download_url": "https://files.pythonhosted.org/packages/2d/8a/37f555cb954a731afb6a8d757aff53fbb75d06f650980e1db45bbd08d610/skippa-0.1.16.tar.gz",
    "platform": null,
    "description": "![pypi](https://img.shields.io/pypi/v/skippa)\n![python versions](https://img.shields.io/pypi/pyversions/skippa)\n![downloads](https://img.shields.io/pypi/dm/skippa)\n![Build status](https://img.shields.io/azure-devops/build/data-science-lab/Intern/263)\n\n<br><br>\n<img src=\"skippa-logo-transparent.png\" alt=\"logo\" width=\"200\"/>\n\n# Skippa \n\nSciKIt-learn Pre-processing Pipeline in PAndas\n\n> __*Read more in the [introduction blog on towardsdatascience](https://towardsdatascience.com/introducing-skippa-bab260acf6a7)*__\n\n\n\nWant to create a machine learning model using pandas & scikit-learn? This should make your life easier.\n\nSkippa helps you to easily create a pre-processing and modeling pipeline, based on scikit-learn transformers but preserving pandas dataframe format throughout all pre-processing. This makes it a lot easier to define a series of subsequent transformation steps, while referring to columns in your intermediate dataframe.\n\nSo basically the same idea as `scikit-pandas`, but a different (and hopefully better) way to achieve it.\n\n- [pypi](https://pypi.org/project/skippa/)\n- [Documentation](https://skippa.readthedocs.io/)\n\n## Installation\n```\npip install skippa\n```\nOptional, if you want to use the [gradio app functionality](./examples/04-gradio-app.py):\n```\npip install skippa[gradio]\n```\n\n## Basic usage\n\nImport `Skippa` class and `columns` helper function\n```\nimport numpy as np\nimport pandas as pd\nfrom sklearn.linear_model import LogisticRegression\n\nfrom skippa import Skippa, columns\n```\n\nGet some data\n```\ndf = pd.DataFrame({\n    'q': [0, 0, 0],\n    'date': ['2021-11-29', '2021-12-01', '2021-12-03'],\n    'x': ['a', 'b', 'c'],\n    'x2': ['m', 'n', 'm'],\n    'y': [1, 16, 1000],\n    'z': [0.4, None, 8.7]\n})\ny = np.array([0, 0, 1])\n```\n\nDefine your pipeline:\n```\npipe = (\n    Skippa()\n        .select(columns(['x', 'x2', 'y', 'z']))\n        .cast(columns(['x', 'x2']), 'category')\n        .impute(columns(dtype_include='number'), strategy='median')\n        .impute(columns(dtype_include='category'), strategy='most_frequent')\n        .scale(columns(dtype_include='number'), type='standard')\n        .onehot(columns(['x', 'x2']))\n        .model(LogisticRegression())\n)\n```\n\nand use it for fitting / predicting like this:\n```\npipe.fit(X=df, y=y)\n\npredictions = pipe.predict_proba(df)\n```\n\nIf you want details on your model, use:\n```\nmodel = pipe.get_model()\nprint(model.coef_)\nprint(model.intercept_)\n```\n\n## (de)serialization\nAnd of course you can save and load your model pipelines (for deployment).\nN.B. [`dill`](https://pypi.org/project/dill/) is used for ser/de because joblib and pickle don't provide enough support.\n```\npipe.save('./models/my_skippa_model_pipeline.dill')\n\n...\n\nmy_pipeline = Skippa.load_pipeline('./models/my_skippa_model_pipeline.dill')\npredictions = my_pipeline.predict(df_new_data)\n```\n\nSee the [./examples](./examples) directory for more examples:\n- [01-standard-pipeline.py](./examples/01-standard-pipeline.py)\n- [02-preprocessing-only.py](./examples/02-preprocessing-only.py)\n- [03a-gridsearch.py](./examples/03a-gridsearch.py)\n- [03b-hyperopt.py](./examples/03b-hyperopt.py)\n- [04-gradio-app.py](./examples/04-gradio-app.py)\n- [05-PCA.py](./examples/05-PCA.py)\n\n## To Do\n- [x] Support pandas assign for creating new columns based on existing columns\n- [x] Support cast / astype transformer\n- [x] Support for .apply transformer: wrapper around `pandas.DataFrame.apply`\n- [x] Check how GridSearch (or other param search) works with Skippa\n- [x] Add a method to inspect a fitted pipeline/model by creating a Gradio app defining raw features input and model output\n- [x] Support PCA transformer\n- [ ] Facilitate random seed in Skippa object that is dispatched to all downstream operations\n- [ ] fit-transform does lazy evaluation > cast to category and then selecting category columns doesn't work > each fit/transform should work on the expected output state of the previous transformer, rather than on the original dataframe\n- [ ] Investigate if Skippa can directly extend sklearn's Pipeline -> using __getitem__ trick\n- [ ] Use sklearn's new dataframe output setting\n- [ ] Validation of pipeline steps\n- [ ] Input validation in transformers\n- [ ] Transformer for replacing values (pandas .replace)\n- [ ] Support arbitrary transformer (if column-preserving)\n- [ ] Eliminate the need to call columns explicitly\n\n\n## Credits\n- Skippa is powered by [Data Science Lab Amsterdam](https://www.datasciencelab.nl)\n- This project structure is based on the [`audreyr/cookiecutter-pypackage`](https://github.com/audreyr/cookiecutter-pypackage) project template.\n\n\n# History\n\n## 0.1.16 (2023-08-17)\n- Bugfix: missing _replace_none attribute for SimpleImputer with strategy='constant'\n\n## 0.1.15 (2022-11-18)\n- Fix: when saving a pipeline, include dependencies in dill serialization.\n\n## 0.1.14 (2022-05-13)\n- Bugfix in .assign: shouldn't have columns\n- Bugfix in imputer: explicit missing_values arg leads to issues\n- Used space-titanic data in examples\n- Logo added :)\n\n## 0.1.13 (2022-04-08)\n- Bugfix in imputer: using strategy='constant' threw a TypeError when used on string columns\n\n## 0.1.12 (2022-02-07)\n- Gradio & dependencies are not installed by default, but are declared an optional extra in setup\n\n## 0.1.11 (2022-01-13)\n- Example added for hyperparameter tuning with Hyperopt\n\n## 0.1.10 (2021-12-28)\n- Added support for PCA (including example)\n- Gradio app support extended to regression\n- Minor cleanup and improvements\n\n## 0.1.9 (2021-12-24)\n- Added support for automatic creation of Gradio app for model inspection\n- Added example with Gradio app\n\n## 0.1.8 (2021-12-23)\n- Removed print statement in SkippaSimpleImputer\n- Added unit tests\n\n## 0.1.7 (2021-12-20)\n- Fixed issue that GridSearchCV (or hyperparam in general) did not work on Skippa pipeline\n- Example added using GridSearch\n\n## 0.1.6 (2021-12-17)\n- Docs, setup, readme updates\n- Updated `.apply()` method so that is accepts a columns specifier\n\n## 0.1.5 (2021-12-13)\n- Fixes for readthedocs\n\n## 0.1.4 (2021-12-13)\n- Cleanup/fix in examples/full-pipeline.py\n\n## 0.1.3 (2021-12-10)\n- Added `.apply()` transformer for `pandas.DataFrame.apply()` functionality\n- Documentation and examples update\n\n## 0.1.2 (2021-11-28)\n- Added `.assign()` transformer for `pandas.DataFrame.assign()` functionality\n- Added `.cast()` transformer (with aliases `.astype()` & `.as_type()`) for `pandas.DataFrame.astype` functionality\n\n## 0.1.1 (2021-11-22)\n- Fixes and documentation.\n\n## 0.1.0 (2021-11-19)\n- First release on PyPI.\n\n\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "SciKIt-learn Pre-processing Pipeline in PAndas",
    "version": "0.1.16",
    "project_urls": {
        "Documentation": "https://skippa.readthedocs.io/",
        "Homepage": "https://github.com/data-science-lab-amsterdam/skippa"
    },
    "split_keywords": [
        "preprocessing",
        "pipeline",
        "pandas",
        "sklearn"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a174159f75c28883048457de5b2d63ac98f73798c28c3f0687a72189a428e9fd",
                "md5": "9f0f5c30ce142f4b86393cde83a540b2",
                "sha256": "4865653843d9ec686a071fcff591e54b369383be6e1037b0e8b96bf7da2f5b5d"
            },
            "downloads": -1,
            "filename": "skippa-0.1.16-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "9f0f5c30ce142f4b86393cde83a540b2",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 20620,
            "upload_time": "2023-08-18T07:45:07",
            "upload_time_iso_8601": "2023-08-18T07:45:07.397557Z",
            "url": "https://files.pythonhosted.org/packages/a1/74/159f75c28883048457de5b2d63ac98f73798c28c3f0687a72189a428e9fd/skippa-0.1.16-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "2d8a37f555cb954a731afb6a8d757aff53fbb75d06f650980e1db45bbd08d610",
                "md5": "e7348e4a17aed95ac9ec3c677a7e2e15",
                "sha256": "fdf9f99f1d8dcb2d7d8c21831079bf20388c2ffcc5cf7ae19b019f83d56bc878"
            },
            "downloads": -1,
            "filename": "skippa-0.1.16.tar.gz",
            "has_sig": false,
            "md5_digest": "e7348e4a17aed95ac9ec3c677a7e2e15",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 27174,
            "upload_time": "2023-08-18T07:45:08",
            "upload_time_iso_8601": "2023-08-18T07:45:08.889889Z",
            "url": "https://files.pythonhosted.org/packages/2d/8a/37f555cb954a731afb6a8d757aff53fbb75d06f650980e1db45bbd08d610/skippa-0.1.16.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-08-18 07:45:08",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "data-science-lab-amsterdam",
    "github_project": "skippa",
    "travis_ci": true,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "pandas",
            "specs": [
                [
                    ">=",
                    "1.0.0"
                ]
            ]
        },
        {
            "name": "scikit-learn",
            "specs": [
                [
                    ">=",
                    "1.0.0"
                ]
            ]
        },
        {
            "name": "dill",
            "specs": [
                [
                    ">=",
                    "0.3.4"
                ]
            ]
        },
        {
            "name": "gradio",
            "specs": [
                [
                    ">=",
                    "2.5.3"
                ]
            ]
        }
    ],
    "tox": true,
    "lcname": "skippa"
}
        
Elapsed time: 0.13584s