.. -*- mode: rst -*-
|ReadTheDocs|_ |License|_ |PyPi|_
.. |ReadTheDocs| image:: https://readthedocs.org/projects/cinnamon/badge
.. _ReadTheDocs: https://cinnamon.readthedocs.io/en/add-documentation
.. |License| image:: https://img.shields.io/badge/License-MIT-yellow
.. _License: https://github.com/zelros/cinnamon/blob/master/LICENSE.txt
.. |PyPi| image:: https://img.shields.io/pypi/v/cinnamon
.. _PyPi: https://pypi.org/project/cinnamon/
===============================
Introduction to CinnaMon
===============================
**CinnaMon** is a Python library which allows to monitor data drift on a
machine learning system. It provides tools to study data drift between two datasets,
especially to detect, explain, and correct data drift.
⚡️ Quickstart
===============
As a quick example, let's illustrate the use of CinnaMon on the breast cancer data where we voluntarily introduce some data drift.
Setup the data and build a model
------------------------------------
.. code:: python
>>> import pandas as pd
>>> from sklearn import datasets
>>> from sklearn.model_selection import train_test_split
>>> from xgboost import XGBClassifier
# load breast cancer data
>>> dataset = datasets.load_breast_cancer()
>>> X = pd.DataFrame(dataset.data, columns = dataset.feature_names)
>>> y = dataset.target
# split data in train and valid dataset
>>> X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.3, random_state=2021)
# introduce some data drift in valid by filtering with 'worst symmetry' feature
>>> y_valid = y_valid[X_valid['worst symmetry'].values > 0.3]
>>> X_valid = X_valid.loc[X_valid['worst symmetry'].values > 0.3, :].copy()
# fit a XGBClassifier on the training data
>>> clf = XGBClassifier(use_label_encoder=False)
>>> clf.fit(X=X_train, y=y_train, verbose=10)
Initialize ModelDriftExplainer and fit on train and validation data
-------------------------------------------------------------------------
.. code:: python
>>> import cinnamon
>>> from cinnamon.drift import ModelDriftExplainer
# initialize a drift explainer with the built XGBClassifier and fit it on train
# and valid data
>>> drift_explainer = ModelDriftExplainer(model=clf)
>>> drift_explainer.fit(X1=X_train, X2=X_valid, y1=y_train, y2=y_valid)
Detect data drift by looking at main graphs and metrics
----------------------------------------------------------
.. code:: python
# Distribution of logit predictions
>>> cinnamon.plot_prediction_drift(drift_explainer, bins=15)
.. image:: https://github.com/zelros/cinnamon/raw/master/docs/img/plot_prediction_drift.png
:width: 400
:align: center
We can see on this graph that because of the data drift we introduced in validation
data the distribution of predictions are different (they do not overlap well). We
can also compute the corresponding drift metrics:
.. code:: python
# Corresponding metrics
>>> drift_explainer.get_prediction_drift()
[{'mean_difference': -3.643428434667366,
'wasserstein': 3.643428434667366,
'kolmogorov_smirnov': KstestResult(statistic=0.2913775225333014, pvalue=0.00013914094110123454)}]
Comparing the distributions of predictions for two datasets is one of the main
indicator we use in order to detect data drift. The two other indicators are:
- distribution of the target (see ``get_target_drift``)
- performance metrics (see ``get_performance_metrics_drift``)
Explain data drift by computing the drift importances
--------------------------------------------------------
Drift importances can be thought as equivalent of feature importances but in terms of data drift.
.. code:: python
# plot drift importances
>>> cinnamon.plot_tree_based_drift_importances(drift_explainer, n=7)
.. image:: https://github.com/zelros/cinnamon/raw/master/docs/img/plot_drift_values.png
:width: 400
:align: center
Here the feature ``worst symmetry`` is rightly identified as the one which contributes the most to the data drift.
More
------
See "notes" below to explore all the functionalities of CinnaMon.
🛠 Installation
=================
CinnaMon is intended to work with **Python 3.7 or above**. Installation can be done with ``pip``:
.. code:: sh
$ pip install cinnamon
🔗 Notes
===========
- CinnaMon `documentation <https://cinnamon.readthedocs.io/en/latest>`_
- The two main classes of CinnaMon are ``ModelDriftExplainer`` and ``AdversarialDriftExplainer``
- CinnaMon supports both model specific and model agnostic methods for the computation of
drift importances. More information `here <https://cinnamon.readthedocs.io/en/latest/model_support.html>`_.
- CinnaMon can be used with any model or ML pipeline thanks to model agnostic mode.
- See notebooks in the ``examples/`` directory to have an overview of all functionalities.
Notably:
- `Covariate shift example with IEEE data <https://github.com/zelros/cinnamon/blob/master/examples/ieee_fraud_simulated_covariate_shift_card6.ipynb>`_
- `Concept drift example with IEEE data <https://github.com/zelros/cinnamon/blob/master/examples/ieee_fraud_simulated_concept_drift_card6.ipynb>`_
These two notebooks also go deeper into the topic of how to correct data drift, making use of ``AdversarialDriftExplainer``
- See also the `slide presentation <https://yohannlefaou.github.io/publications/2021-cinnamon/Detect_explain_and_correct_data_drift_in_a_machine_learning_system.pdf>`_
of the CinnaMon library. And the `video presentation <https://www.youtube.com/watch?v=S3qoBBwSS1I>`_.
👍 Contributing
=================
Check out the `contribution <https://github.com/zelros/cinnamon/blob/master/CONTRIBUTING.md>`_ section.
📝 License
============
CinnaMon is free and open-source software licensed under the `MIT <https://github.com/zelros/cinnamon/blob/master/LICENSE.txt>`_.
Raw data
{
"_id": null,
"home_page": "https://github.com/zelros/cinnamon",
"name": "cinnamon",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": "",
"keywords": "data drift,covariate shift,concept drift,monitoring,adversarial learning,machine learning",
"author": "Yohann Le Faou",
"author_email": "lefaou.yohann@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/1f/5f/4ea6e216f0c65fb47617c3857a32301048ef9131919ab44506f05b4846da/cinnamon-0.2.1.tar.gz",
"platform": null,
"description": ".. -*- mode: rst -*-\n\n|ReadTheDocs|_ |License|_ |PyPi|_\n\n.. |ReadTheDocs| image:: https://readthedocs.org/projects/cinnamon/badge\n.. _ReadTheDocs: https://cinnamon.readthedocs.io/en/add-documentation\n\n.. |License| image:: https://img.shields.io/badge/License-MIT-yellow\n.. _License: https://github.com/zelros/cinnamon/blob/master/LICENSE.txt\n\n.. |PyPi| image:: https://img.shields.io/pypi/v/cinnamon\n.. _PyPi: https://pypi.org/project/cinnamon/\n\n===============================\nIntroduction to CinnaMon \n===============================\n\n**CinnaMon** is a Python library which allows to monitor data drift on a \nmachine learning system. It provides tools to study data drift between two datasets,\nespecially to detect, explain, and correct data drift.\n\n\u26a1\ufe0f Quickstart\n===============\n\nAs a quick example, let's illustrate the use of CinnaMon on the breast cancer data where we voluntarily introduce some data drift.\n\nSetup the data and build a model\n------------------------------------\n\n.. code:: python\n\n >>> import pandas as pd\n >>> from sklearn import datasets\n >>> from sklearn.model_selection import train_test_split\n >>> from xgboost import XGBClassifier\n\n # load breast cancer data\n >>> dataset = datasets.load_breast_cancer()\n >>> X = pd.DataFrame(dataset.data, columns = dataset.feature_names)\n >>> y = dataset.target\n\n # split data in train and valid dataset\n >>> X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.3, random_state=2021)\n\n # introduce some data drift in valid by filtering with 'worst symmetry' feature\n >>> y_valid = y_valid[X_valid['worst symmetry'].values > 0.3]\n >>> X_valid = X_valid.loc[X_valid['worst symmetry'].values > 0.3, :].copy()\n\n # fit a XGBClassifier on the training data\n >>> clf = XGBClassifier(use_label_encoder=False)\n >>> clf.fit(X=X_train, y=y_train, verbose=10)\n\nInitialize ModelDriftExplainer and fit on train and validation data\n-------------------------------------------------------------------------\n\n.. code:: python\n\n >>> import cinnamon\n >>> from cinnamon.drift import ModelDriftExplainer\n\n # initialize a drift explainer with the built XGBClassifier and fit it on train\n # and valid data\n >>> drift_explainer = ModelDriftExplainer(model=clf)\n >>> drift_explainer.fit(X1=X_train, X2=X_valid, y1=y_train, y2=y_valid)\n\nDetect data drift by looking at main graphs and metrics\n----------------------------------------------------------\n\n.. code:: python\n\n # Distribution of logit predictions\n >>> cinnamon.plot_prediction_drift(drift_explainer, bins=15)\n\n.. image:: https://github.com/zelros/cinnamon/raw/master/docs/img/plot_prediction_drift.png\n :width: 400\n :align: center\n\nWe can see on this graph that because of the data drift we introduced in validation \ndata the distribution of predictions are different (they do not overlap well). We \ncan also compute the corresponding drift metrics:\n\n.. code:: python\n\n # Corresponding metrics\n >>> drift_explainer.get_prediction_drift()\n [{'mean_difference': -3.643428434667366,\n 'wasserstein': 3.643428434667366,\n 'kolmogorov_smirnov': KstestResult(statistic=0.2913775225333014, pvalue=0.00013914094110123454)}]\n\nComparing the distributions of predictions for two datasets is one of the main \nindicator we use in order to detect data drift. The two other indicators are:\n\n- distribution of the target (see ``get_target_drift``)\n- performance metrics (see ``get_performance_metrics_drift``)\n\nExplain data drift by computing the drift importances\n--------------------------------------------------------\n\nDrift importances can be thought as equivalent of feature importances but in terms of data drift.\n\n.. code:: python\n\n # plot drift importances\n >>> cinnamon.plot_tree_based_drift_importances(drift_explainer, n=7)\n\n.. image:: https://github.com/zelros/cinnamon/raw/master/docs/img/plot_drift_values.png\n :width: 400\n :align: center\n\nHere the feature ``worst symmetry`` is rightly identified as the one which contributes the most to the data drift.\n\nMore\n------\n\nSee \"notes\" below to explore all the functionalities of CinnaMon.\n\n\ud83d\udee0 Installation\n=================\n\nCinnaMon is intended to work with **Python 3.7 or above**. Installation can be done with ``pip``:\n\n.. code:: sh\n \n $ pip install cinnamon\n\n\ud83d\udd17 Notes\n===========\n\n- CinnaMon `documentation <https://cinnamon.readthedocs.io/en/latest>`_\n- The two main classes of CinnaMon are ``ModelDriftExplainer`` and ``AdversarialDriftExplainer``\n- CinnaMon supports both model specific and model agnostic methods for the computation of \n drift importances. More information `here <https://cinnamon.readthedocs.io/en/latest/model_support.html>`_.\n- CinnaMon can be used with any model or ML pipeline thanks to model agnostic mode.\n- See notebooks in the ``examples/`` directory to have an overview of all functionalities. \n Notably:\n\n - `Covariate shift example with IEEE data <https://github.com/zelros/cinnamon/blob/master/examples/ieee_fraud_simulated_covariate_shift_card6.ipynb>`_\n - `Concept drift example with IEEE data <https://github.com/zelros/cinnamon/blob/master/examples/ieee_fraud_simulated_concept_drift_card6.ipynb>`_\n \n These two notebooks also go deeper into the topic of how to correct data drift, making use of ``AdversarialDriftExplainer``\n- See also the `slide presentation <https://yohannlefaou.github.io/publications/2021-cinnamon/Detect_explain_and_correct_data_drift_in_a_machine_learning_system.pdf>`_\n of the CinnaMon library. And the `video presentation <https://www.youtube.com/watch?v=S3qoBBwSS1I>`_.\n\n\ud83d\udc4d Contributing\n=================\n\nCheck out the `contribution <https://github.com/zelros/cinnamon/blob/master/CONTRIBUTING.md>`_ section.\n\n\ud83d\udcdd License\n============\n\nCinnaMon is free and open-source software licensed under the `MIT <https://github.com/zelros/cinnamon/blob/master/LICENSE.txt>`_.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A monitoring tool for machine learning systems that focus on data drift",
"version": "0.2.1",
"split_keywords": [
"data drift",
"covariate shift",
"concept drift",
"monitoring",
"adversarial learning",
"machine learning"
],
"urls": [
{
"comment_text": "",
"digests": {
"md5": "7a029f2d6b03adef587242277e8e5385",
"sha256": "270a06ed40f02b63b44aad9f0115afdcb2e7c3475be85cbe0fadfb82f7e1e7ed"
},
"downloads": -1,
"filename": "cinnamon-0.2.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "7a029f2d6b03adef587242277e8e5385",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 85762,
"upload_time": "2022-12-06T22:22:02",
"upload_time_iso_8601": "2022-12-06T22:22:02.307087Z",
"url": "https://files.pythonhosted.org/packages/d7/10/7150cb9b910ff00099af9ca6143d440a31c50467afe15e08d0d1246763cb/cinnamon-0.2.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"md5": "bc150f3c2372ef1afc4215f987839cf3",
"sha256": "ad1da6ac65c78fd737395e2e4bfdbcbd2c3847ea9d3f866f1ae37fcdc47b9e80"
},
"downloads": -1,
"filename": "cinnamon-0.2.1.tar.gz",
"has_sig": false,
"md5_digest": "bc150f3c2372ef1afc4215f987839cf3",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 72439,
"upload_time": "2022-12-06T22:22:04",
"upload_time_iso_8601": "2022-12-06T22:22:04.469730Z",
"url": "https://files.pythonhosted.org/packages/1f/5f/4ea6e216f0c65fb47617c3857a32301048ef9131919ab44506f05b4846da/cinnamon-0.2.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2022-12-06 22:22:04",
"github": true,
"gitlab": false,
"bitbucket": false,
"github_user": "zelros",
"github_project": "cinnamon",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "cinnamon"
}