Name | pure-predict JSON |
Version |
0.0.4
JSON |
| download |
home_page | |
Summary | Machine learning prediction in pure Python |
upload_time | 2020-05-25 16:48:22 |
maintainer | |
docs_url | None |
author | Ibotta Inc. |
requires_python | >=3.6 |
license | Apache 2.0 |
keywords |
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
|
coveralls test coverage |
No coveralls.
|
pure-predict: Machine learning prediction in pure Python
========================================================
|License| |Build Status| |PyPI Package| |Python Versions|
``pure-predict`` speeds up and slims down machine learning prediction applications. It is
a foundational tool for serverless inference or small batch prediction with popular machine
learning frameworks like `scikit-learn <https://scikit-learn.org/stable/>`__ and `fasttext <https://fasttext.cc/>`__.
It implements the predict methods of these frameworks in pure Python.
Primary Use Cases
-----------------
The primary use case for ``pure-predict`` is the following scenario:
#. A model is trained in an environment without strong container footprint constraints. Perhaps a long running "offline" job on one or many machines where installing a number of python packages from PyPI is not at all problematic.
#. At prediction time the model needs to be served behind an API. Typical access patterns are to request a prediction for one "record" (one "row" in a ``numpy`` array or one string of text to classify) per request or a mini-batch of records per request.
#. Preferred infrastructure for the prediction service is either serverless (`AWS Lambda <https://aws.amazon.com/lambda/>`__) or a container service where the memory footprint of the container is constrained.
#. The fitted model object's artifacts needed for prediction (coefficients, weights, vocabulary, decision tree artifacts, etc.) are relatively small (10s to 100s of MBs).
In this scenario, a container service with a large dependency footprint can be overkill for a microservice, particularly if the access patterns favor the pricing model of a serverless application. Additionally, for smaller models and single record predictions per request, the ``numpy`` and ``scipy`` functionality in the prediction methods of popular machine learning frameworks work against the application in terms of latency, `underperforming pure python <https://github.com/Ibotta/pure-predict/blob/master/examples/performance_rf.py>`__ in some cases.
Check out the `blog post <https://medium.com/building-ibotta/predict-with-sklearn-20x-faster-9f2803944446>`__
for more information on the motivation and use cases of ``pure-predict``.
Package Details
---------------
It is a Python package for machine learning prediction distributed under
the `Apache 2.0 software license <https://github.com/Ibotta/sk-dist/blob/master/LICENSE>`__.
It contains multiple subpackages which mirror their open source
counterpart (``scikit-learn``, ``fasttext``, etc.). Each subpackage has utilities to
convert a fitted machine learning model into a custom object containing prediction methods
that mirror their native counterparts, but converted to pure python. Additionally, all
relevant model artifacts needed for prediction are converted to pure python.
A ``pure-predict`` model object can then be pickled and later
unpickled without any 3rd party dependencies other than ``pure-predict``.
This eliminates the need to have large dependency packages installed in order to
make predictions with fitted machine learning models using popular open source packages for
training models. These dependencies (``numpy``, ``scipy``, ``scikit-learn``, ``fasttext``, etc.)
are large in size and `not always necessary to make fast and accurate
predictions <https://github.com/Ibotta/pure-predict/blob/master/examples/performance_rf.py>`__.
Additionally, they rely on C extensions that may not be ideal for serverless applications with a python runtime.
Quick Start Example
-------------------
In a python enviornment with ``scikit-learn`` and its dependencies installed:
.. code-block:: python
import pickle
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from pure_sklearn.map import convert_estimator
# fit sklearn estimator
X, y = load_iris(return_X_y=True)
clf = RandomForestClassifier()
clf.fit(X, y)
# convert to pure python estimator
clf_pure_predict = convert_estimator(clf)
with open("model.pkl", "wb") as f:
pickle.dump(clf_pure_predict, f)
# make prediction with sklearn estimator
y_pred = clf.predict([[0.25, 2.0, 8.3, 1.0]])
print(y_pred)
[2]
In a python enviornment with only ``pure-predict`` installed:
.. code-block:: python
import pickle
# load pickled model
with open("model.pkl", "rb") as f:
clf = pickle.load(f)
# make prediction with pure-predict object
y_pred = clf.predict([[0.25, 2.0, 8.3, 1.0]])
print(y_pred)
[2]
Subpackages
-----------
`pure_sklearn <https://github.com/Ibotta/pure-predict/tree/master/pure_sklearn>`__
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Prediction in pure python for a subset of ``scikit-learn`` estimators and transformers.
- **estimators**
- **linear models** - supports the majority of linear models for classification
- **trees** - decision trees, random forests, gradient boosting and xgboost
- **naive bayes** - a number of popular naive bayes classifiers
- **svm** - linear SVC
- **transformers**
- **preprocessing** - normalization and onehot/ordinal encoders
- **impute** - simple imputation
- **feature extraction** - text (tfidf, count vectorizer, hashing vectorizer) and dictionary vectorization
- **pipeline** - pipelines and feature unions
Sparse data - supports a custom pure python sparse data object - sparse data is handled as would be expected by the relevent transformers and estimators
`pure_fasttext <https://github.com/Ibotta/pure-predict/tree/master/pure_fasttext>`__
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Prediction in pure python for ``fasttext``.
- **supervised** - predicts labels for supervised models; no support for quantized models (blocked by `this issue <https://github.com/facebookresearch/fastText/issues/984>`__)
- **unsupervised** - lookup of word or sentence embeddings given input text
Installation
------------
Dependencies
~~~~~~~~~~~~
``pure-predict`` requires:
- `Python <https://www.python.org/>`__ (>= 3.6)
Dependency Notes
~~~~~~~~~~~~~~~~
- ``pure_sklearn`` has been tested with ``scikit-learn`` versions >= 0.20 -- certain functionality may work with lower versions but are not guaranteed. Some functionality is explicitly not supported for certain ``scikit-learn`` versions and exceptions will be raised as appropriate.
- ``xgboost`` requires version >= 0.82 for support with ``pure_sklearn``.
- ``pure-predict`` is not supported with Python 2.
- ``fasttext`` versions <= 0.9.1 have been tested.
User Installation
~~~~~~~~~~~~~~~~~
The easiest way to install ``pure-predict`` is with ``pip``:
::
pip install --upgrade pure-predict
You can also download the source code:
::
git clone https://github.com/Ibotta/pure-predict.git
Testing
~~~~~~~
With ``pytest`` installed, you can run tests locally:
::
pytest pure-predict
Examples
--------
The package contains `examples <https://github.com/Ibotta/pure-predict/tree/master/examples>`__
on how to use ``pure-predict`` in practice.
Calls for Contributors
----------------------
Contributing to ``pure-predict`` is `welcomed by any contributors <https://github.com/Ibotta/pure-predict/blob/master/CONTRIBUTING.md>`__. Specific calls for contribution are as follows:
#. Examples, tests and documentation -- particularly more detailed examples with performance testing of various estimators under various constraints.
#. Adding more ``pure_sklearn`` estimators. The ``scikit-learn`` package is extensive and only partially covered by ``pure_sklearn``. `Regression <https://scikit-learn.org/stable/supervised_learning.html#supervised-learning>`__ tasks in particular missing from ``pure_sklearn``. `Clustering <https://scikit-learn.org/stable/modules/clustering.html#clustering>`__, `dimensionality reduction <https://scikit-learn.org/stable/modules/decomposition.html#decompositions>`__, `nearest neighbors <https://scikit-learn.org/stable/modules/neighbors.html>`__, `feature selection <https://scikit-learn.org/stable/modules/feature_selection.html>`__, non-linear `SVM <https://scikit-learn.org/stable/modules/svm.html>`__, and more are also omitted and would be good candidates for extending ``pure_sklearn``.
#. General efficiency. There is likely low hanging fruit for improving the efficiency of the ``numpy`` and ``scipy`` functionality that has been ported to ``pure-predict``.
#. `Threading <https://docs.python.org/3/library/threading.html>`__ could be considered to improve performance -- particularly for making predictions with multiple records.
#. A public `AWS lambda layer <https://docs.aws.amazon.com/lambda/latest/dg/configuration-layers.html>`__ containing ``pure-predict``.
Background
----------
The project was started at `Ibotta
Inc. <https://medium.com/building-ibotta>`__ on the machine learning
team and open sourced in 2020. It is currently maintained by the machine
learning team at Ibotta.
Acknowledgements
~~~~~~~~~~~~~~~~
Thanks to `David Mitchell <https://github.com/dlmitchell>`__ and `Andrew Tilley <https://github.com/tilleyand>`__ for internal review before open source. Thanks to `James Foley <https://github.com/chadfoley36>`__ for logo artwork.
.. |License| image:: https://img.shields.io/badge/License-Apache%202.0-blue.svg
:target: https://opensource.org/licenses/Apache-2.0
.. |Build Status| image:: https://travis-ci.com/Ibotta/pure-predict.png?branch=master
:target: https://travis-ci.com/Ibotta/pure-predict
.. |PyPI Package| image:: https://badge.fury.io/py/pure-predict.svg
:target: https://pypi.org/project/pure-predict/
.. |Python Versions| image:: https://img.shields.io/pypi/pyversions/pure-predict
:target: https://pypi.org/project/pure-predict/
Raw data
{
"_id": null,
"home_page": "",
"name": "pure-predict",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.6",
"maintainer_email": "",
"keywords": "",
"author": "Ibotta Inc.",
"author_email": "machine_learning@ibotta.com",
"download_url": "https://files.pythonhosted.org/packages/de/35/c138d6df5cc212cbc21ee797949bcdb46b407f627a75a91c09b41f63d69d/pure-predict-0.0.4.tar.gz",
"platform": "",
"description": "pure-predict: Machine learning prediction in pure Python\n========================================================\n\n|License| |Build Status| |PyPI Package| |Python Versions|\n\n``pure-predict`` speeds up and slims down machine learning prediction applications. It is \na foundational tool for serverless inference or small batch prediction with popular machine \nlearning frameworks like `scikit-learn <https://scikit-learn.org/stable/>`__ and `fasttext <https://fasttext.cc/>`__. \nIt implements the predict methods of these frameworks in pure Python.\n\nPrimary Use Cases\n-----------------\nThe primary use case for ``pure-predict`` is the following scenario: \n\n#. A model is trained in an environment without strong container footprint constraints. Perhaps a long running \"offline\" job on one or many machines where installing a number of python packages from PyPI is not at all problematic.\n#. At prediction time the model needs to be served behind an API. Typical access patterns are to request a prediction for one \"record\" (one \"row\" in a ``numpy`` array or one string of text to classify) per request or a mini-batch of records per request.\n#. Preferred infrastructure for the prediction service is either serverless (`AWS Lambda <https://aws.amazon.com/lambda/>`__) or a container service where the memory footprint of the container is constrained.\n#. The fitted model object's artifacts needed for prediction (coefficients, weights, vocabulary, decision tree artifacts, etc.) are relatively small (10s to 100s of MBs).\n\n\nIn this scenario, a container service with a large dependency footprint can be overkill for a microservice, particularly if the access patterns favor the pricing model of a serverless application. Additionally, for smaller models and single record predictions per request, the ``numpy`` and ``scipy`` functionality in the prediction methods of popular machine learning frameworks work against the application in terms of latency, `underperforming pure python <https://github.com/Ibotta/pure-predict/blob/master/examples/performance_rf.py>`__ in some cases.\n\nCheck out the `blog post <https://medium.com/building-ibotta/predict-with-sklearn-20x-faster-9f2803944446>`__ \nfor more information on the motivation and use cases of ``pure-predict``.\n\nPackage Details\n---------------\n\nIt is a Python package for machine learning prediction distributed under \nthe `Apache 2.0 software license <https://github.com/Ibotta/sk-dist/blob/master/LICENSE>`__. \nIt contains multiple subpackages which mirror their open source \ncounterpart (``scikit-learn``, ``fasttext``, etc.). Each subpackage has utilities to \nconvert a fitted machine learning model into a custom object containing prediction methods \nthat mirror their native counterparts, but converted to pure python. Additionally, all \nrelevant model artifacts needed for prediction are converted to pure python. \n\nA ``pure-predict`` model object can then be pickled and later\nunpickled without any 3rd party dependencies other than ``pure-predict``.\n\nThis eliminates the need to have large dependency packages installed in order to \nmake predictions with fitted machine learning models using popular open source packages for\ntraining models. These dependencies (``numpy``, ``scipy``, ``scikit-learn``, ``fasttext``, etc.) \nare large in size and `not always necessary to make fast and accurate\npredictions <https://github.com/Ibotta/pure-predict/blob/master/examples/performance_rf.py>`__. \nAdditionally, they rely on C extensions that may not be ideal for serverless applications with a python runtime.\n\nQuick Start Example\n-------------------\n\nIn a python enviornment with ``scikit-learn`` and its dependencies installed:\n\n.. code-block:: python\n \n import pickle\n \n from sklearn.ensemble import RandomForestClassifier\n from sklearn.datasets import load_iris\n from pure_sklearn.map import convert_estimator\n \n # fit sklearn estimator\n X, y = load_iris(return_X_y=True)\n clf = RandomForestClassifier()\n clf.fit(X, y)\n \n # convert to pure python estimator\n clf_pure_predict = convert_estimator(clf)\n with open(\"model.pkl\", \"wb\") as f: \n pickle.dump(clf_pure_predict, f) \n \n # make prediction with sklearn estimator\n y_pred = clf.predict([[0.25, 2.0, 8.3, 1.0]])\n print(y_pred)\n [2]\n \nIn a python enviornment with only ``pure-predict`` installed:\n\n.. code-block:: python\n\n import pickle\n \n # load pickled model\n with open(\"model.pkl\", \"rb\") as f: \n clf = pickle.load(f) \n \n # make prediction with pure-predict object\n y_pred = clf.predict([[0.25, 2.0, 8.3, 1.0]])\n print(y_pred)\n [2]\n\nSubpackages\n-----------\n\n`pure_sklearn <https://github.com/Ibotta/pure-predict/tree/master/pure_sklearn>`__\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\nPrediction in pure python for a subset of ``scikit-learn`` estimators and transformers.\n\n- **estimators**\n - **linear models** - supports the majority of linear models for classification\n - **trees** - decision trees, random forests, gradient boosting and xgboost \n - **naive bayes** - a number of popular naive bayes classifiers\n - **svm** - linear SVC\n- **transformers**\n - **preprocessing** - normalization and onehot/ordinal encoders\n - **impute** - simple imputation \n - **feature extraction** - text (tfidf, count vectorizer, hashing vectorizer) and dictionary vectorization \n - **pipeline** - pipelines and feature unions\n\nSparse data - supports a custom pure python sparse data object - sparse data is handled as would be expected by the relevent transformers and estimators\n \n`pure_fasttext <https://github.com/Ibotta/pure-predict/tree/master/pure_fasttext>`__\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\nPrediction in pure python for ``fasttext``.\n\n- **supervised** - predicts labels for supervised models; no support for quantized models (blocked by `this issue <https://github.com/facebookresearch/fastText/issues/984>`__)\n- **unsupervised** - lookup of word or sentence embeddings given input text\n\nInstallation\n------------\n\nDependencies\n~~~~~~~~~~~~\n\n``pure-predict`` requires:\n\n- `Python <https://www.python.org/>`__ (>= 3.6)\n\nDependency Notes\n~~~~~~~~~~~~~~~~\n\n- ``pure_sklearn`` has been tested with ``scikit-learn`` versions >= 0.20 -- certain functionality may work with lower versions but are not guaranteed. Some functionality is explicitly not supported for certain ``scikit-learn`` versions and exceptions will be raised as appropriate.\n- ``xgboost`` requires version >= 0.82 for support with ``pure_sklearn``.\n- ``pure-predict`` is not supported with Python 2.\n- ``fasttext`` versions <= 0.9.1 have been tested.\n\nUser Installation\n~~~~~~~~~~~~~~~~~\n\nThe easiest way to install ``pure-predict`` is with ``pip``:\n\n::\n\n pip install --upgrade pure-predict\n\nYou can also download the source code:\n\n::\n\n git clone https://github.com/Ibotta/pure-predict.git\n\nTesting\n~~~~~~~\n\nWith ``pytest`` installed, you can run tests locally:\n\n::\n\n pytest pure-predict\n\nExamples\n--------\n\nThe package contains `examples <https://github.com/Ibotta/pure-predict/tree/master/examples>`__ \non how to use ``pure-predict`` in practice.\n\nCalls for Contributors\n----------------------\n\nContributing to ``pure-predict`` is `welcomed by any contributors <https://github.com/Ibotta/pure-predict/blob/master/CONTRIBUTING.md>`__. Specific calls for contribution are as follows:\n\n#. Examples, tests and documentation -- particularly more detailed examples with performance testing of various estimators under various constraints.\n#. Adding more ``pure_sklearn`` estimators. The ``scikit-learn`` package is extensive and only partially covered by ``pure_sklearn``. `Regression <https://scikit-learn.org/stable/supervised_learning.html#supervised-learning>`__ tasks in particular missing from ``pure_sklearn``. `Clustering <https://scikit-learn.org/stable/modules/clustering.html#clustering>`__, `dimensionality reduction <https://scikit-learn.org/stable/modules/decomposition.html#decompositions>`__, `nearest neighbors <https://scikit-learn.org/stable/modules/neighbors.html>`__, `feature selection <https://scikit-learn.org/stable/modules/feature_selection.html>`__, non-linear `SVM <https://scikit-learn.org/stable/modules/svm.html>`__, and more are also omitted and would be good candidates for extending ``pure_sklearn``.\n#. General efficiency. There is likely low hanging fruit for improving the efficiency of the ``numpy`` and ``scipy`` functionality that has been ported to ``pure-predict``.\n#. `Threading <https://docs.python.org/3/library/threading.html>`__ could be considered to improve performance -- particularly for making predictions with multiple records.\n#. A public `AWS lambda layer <https://docs.aws.amazon.com/lambda/latest/dg/configuration-layers.html>`__ containing ``pure-predict``.\n\nBackground\n----------\n\nThe project was started at `Ibotta\nInc. <https://medium.com/building-ibotta>`__ on the machine learning\nteam and open sourced in 2020. It is currently maintained by the machine \nlearning team at Ibotta.\n\nAcknowledgements\n~~~~~~~~~~~~~~~~\nThanks to `David Mitchell <https://github.com/dlmitchell>`__ and `Andrew Tilley <https://github.com/tilleyand>`__ for internal review before open source. Thanks to `James Foley <https://github.com/chadfoley36>`__ for logo artwork.\n\n\n\n.. |License| image:: https://img.shields.io/badge/License-Apache%202.0-blue.svg\n :target: https://opensource.org/licenses/Apache-2.0\n.. |Build Status| image:: https://travis-ci.com/Ibotta/pure-predict.png?branch=master\n :target: https://travis-ci.com/Ibotta/pure-predict\n.. |PyPI Package| image:: https://badge.fury.io/py/pure-predict.svg\n :target: https://pypi.org/project/pure-predict/\n.. |Python Versions| image:: https://img.shields.io/pypi/pyversions/pure-predict\n :target: https://pypi.org/project/pure-predict/",
"bugtrack_url": null,
"license": "Apache 2.0",
"summary": "Machine learning prediction in pure Python",
"version": "0.0.4",
"project_urls": {
"Download": "https://pypi.org/project/pure-predict/#files",
"Source Code": "https://github.com/Ibotta/pure-predict"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "de35c138d6df5cc212cbc21ee797949bcdb46b407f627a75a91c09b41f63d69d",
"md5": "1d20d79fcccb03be33d61d9c6da7780f",
"sha256": "fb9c2cf7cbbf46e309029d7e3d82c715dec03ae5ebf67e1d16caa50a045947a3"
},
"downloads": -1,
"filename": "pure-predict-0.0.4.tar.gz",
"has_sig": false,
"md5_digest": "1d20d79fcccb03be33d61d9c6da7780f",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.6",
"size": 38802,
"upload_time": "2020-05-25T16:48:22",
"upload_time_iso_8601": "2020-05-25T16:48:22.976392Z",
"url": "https://files.pythonhosted.org/packages/de/35/c138d6df5cc212cbc21ee797949bcdb46b407f627a75a91c09b41f63d69d/pure-predict-0.0.4.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2020-05-25 16:48:22",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Ibotta",
"github_project": "pure-predict",
"travis_ci": true,
"coveralls": false,
"github_actions": false,
"lcname": "pure-predict"
}