==========
DistoGram
==========
.. image:: https://badge.fury.io/py/distogram.svg
:target: https://badge.fury.io/py/distogram
.. image:: https://github.com/maki-nage/distogram/workflows/Python%20package/badge.svg
:target: https://github.com/maki-nage/distogram/actions?query=workflow%3A%22Python+package%22
:alt: Github WorkFlows
.. image:: https://img.shields.io/codecov/c/github/maki-nage/distogram?style=plastic&color=brightgreen&logo=codecov&style=for-the-badge
:target: https://codecov.io/gh/maki-nage/distogram
:alt: Coverage
.. image:: https://readthedocs.org/projects/distogram/badge/?version=latest
:target: https://distogram.readthedocs.io/en/latest/?badge=latest
:alt: Documentation Status
.. image:: https://mybinder.org/badge_logo.svg
:target: https://mybinder.org/v2/gh/maki-nage/distogram/master?urlpath=notebooks%2Fexamples%2Fdistogram.ipynb
DistoGram is a library that allows to compute histogram on streaming data, in
distributed environments. The implementation follows the algorithms described in
Ben-Haim's `Streaming Parallel Decision Trees
<http://jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf>`__
Get Started
============
First create a compressed representation of a distribution:
.. code:: python
import numpy as np
import distogram
distribution = np.random.normal(size=10000)
# Create and feed distogram from distribution
# on a real usage, data comes from an event stream
h = distogram.Distogram()
for i in distribution:
h = distogram.update(h, i)
Compute statistics on the distribution:
.. code:: python
nmin, nmax = distogram.bounds(h)
print("count: {}".format(distogram.count(h)))
print("mean: {}".format(distogram.mean(h)))
print("stddev: {}".format(distogram.stddev(h)))
print("min: {}".format(nmin))
print("5%: {}".format(distogram.quantile(h, 0.05)))
print("25%: {}".format(distogram.quantile(h, 0.25)))
print("50%: {}".format(distogram.quantile(h, 0.50)))
print("75%: {}".format(distogram.quantile(h, 0.75)))
print("95%: {}".format(distogram.quantile(h, 0.95)))
print("max: {}".format(nmax))
.. code:: console
count: 10000
mean: -0.005082954640481095
stddev: 1.0028524290149186
min: -3.5691130319855047
5%: -1.6597242392338374
25%: -0.6785107421744653
50%: -0.008672960012168916
75%: 0.6720718926935414
95%: 1.6476822301131866
max: 3.8800560034877427
Compute and display the histogram of the distribution:
.. code:: python
hist = distogram.histogram(h)
df_hist = pd.DataFrame(np.array(hist), columns=["bin", "count"])
fig = px.bar(df_hist, x="bin", y="count", title="distogram")
fig.update_layout(height=300)
fig.show()
.. image:: docs/normal_histogram.png
:scale: 60%
:align: center
Install
========
DistoGram is available on PyPi and can be installed with pip:
.. code:: console
pip install distogram
Play With Me
============
You can test this library directly on this
`live notebook <https://mybinder.org/v2/gh/maki-nage/distogram/master?urlpath=notebooks%2Fexamples%2Fdistogram.ipynb>`__.
Performances
=============
Distogram is design for fast updates when using python types. The following
numbers show the results of the benchmark program located in the examples.
On a i7-9800X Intel CPU, performances are:
============ ========== ======= ==========
Interpreter Operation Numpy Req/s
============ ========== ======= ==========
pypy 7.3 update no 6563311
pypy 7.3 update yes 111318
CPython 3.7 update no 436709
CPython 3.7 update yes 251603
============ ========== ======= ==========
On a modest 2014 13" macbook pro, performances are:
============ ========== ======= ==========
Interpreter Operation Numpy Req/s
============ ========== ======= ==========
pypy 7.3 update no 3572436
pypy 7.3 update yes 37630
CPython 3.7 update no 112749
CPython 3.7 update yes 81005
============ ========== ======= ==========
As you can see, your are encouraged to use pypy with python native types. Pypy's
jit is penalised by numpy native types, causing a huge performance hit. Moreover
the streaming phylosophy of Distogram is more adapted to python native types
while numpy is optimized for batch computations, even with CPython.
Credits
========
Although this code has been written by following the aforementioned research
paper, some parts are also inspired by the implementation from
`Carson Farmer <https://github.com/carsonfarmer/streamhist>`__.
Thanks to `John Belmonte <https://github.com/belm0>`_ for his help on
performances and accuracy improvements.
Raw data
{
"_id": null,
"home_page": "https://github.com/maki-nage/distogram.git",
"name": "distogram",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "",
"author": "Romain Picard",
"author_email": "romain.picard@oakbits.com",
"download_url": "https://files.pythonhosted.org/packages/ae/16/b94f28021935829b8b50c7c3006e88f8ed0aa8058d96457b1bcae47d7923/distogram-3.0.0.tar.gz",
"platform": "any",
"description": "\n==========\nDistoGram\n==========\n\n\n.. image:: https://badge.fury.io/py/distogram.svg\n :target: https://badge.fury.io/py/distogram\n\n.. image:: https://github.com/maki-nage/distogram/workflows/Python%20package/badge.svg\n :target: https://github.com/maki-nage/distogram/actions?query=workflow%3A%22Python+package%22\n :alt: Github WorkFlows\n\n.. image:: https://img.shields.io/codecov/c/github/maki-nage/distogram?style=plastic&color=brightgreen&logo=codecov&style=for-the-badge\n :target: https://codecov.io/gh/maki-nage/distogram\n :alt: Coverage\n\n.. image:: https://readthedocs.org/projects/distogram/badge/?version=latest\n :target: https://distogram.readthedocs.io/en/latest/?badge=latest\n :alt: Documentation Status\n\n.. image:: https://mybinder.org/badge_logo.svg\n :target: https://mybinder.org/v2/gh/maki-nage/distogram/master?urlpath=notebooks%2Fexamples%2Fdistogram.ipynb\n\n\nDistoGram is a library that allows to compute histogram on streaming data, in\ndistributed environments. The implementation follows the algorithms described in\nBen-Haim's `Streaming Parallel Decision Trees\n<http://jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf>`__\n\nGet Started\n============\n\nFirst create a compressed representation of a distribution:\n\n.. code:: python\n\n import numpy as np\n import distogram\n\n distribution = np.random.normal(size=10000)\n\n # Create and feed distogram from distribution\n # on a real usage, data comes from an event stream\n h = distogram.Distogram()\n for i in distribution:\n h = distogram.update(h, i)\n\n\nCompute statistics on the distribution:\n\n.. code:: python\n\n nmin, nmax = distogram.bounds(h)\n print(\"count: {}\".format(distogram.count(h)))\n print(\"mean: {}\".format(distogram.mean(h)))\n print(\"stddev: {}\".format(distogram.stddev(h)))\n print(\"min: {}\".format(nmin))\n print(\"5%: {}\".format(distogram.quantile(h, 0.05)))\n print(\"25%: {}\".format(distogram.quantile(h, 0.25)))\n print(\"50%: {}\".format(distogram.quantile(h, 0.50)))\n print(\"75%: {}\".format(distogram.quantile(h, 0.75)))\n print(\"95%: {}\".format(distogram.quantile(h, 0.95)))\n print(\"max: {}\".format(nmax))\n\n\n.. code:: console\n\n count: 10000\n mean: -0.005082954640481095\n stddev: 1.0028524290149186\n min: -3.5691130319855047\n 5%: -1.6597242392338374\n 25%: -0.6785107421744653\n 50%: -0.008672960012168916\n 75%: 0.6720718926935414\n 95%: 1.6476822301131866\n max: 3.8800560034877427\n\nCompute and display the histogram of the distribution:\n\n.. code:: python\n\n hist = distogram.histogram(h)\n df_hist = pd.DataFrame(np.array(hist), columns=[\"bin\", \"count\"])\n fig = px.bar(df_hist, x=\"bin\", y=\"count\", title=\"distogram\")\n fig.update_layout(height=300)\n fig.show()\n\n.. image:: docs/normal_histogram.png\n :scale: 60%\n :align: center\n\nInstall\n========\n\nDistoGram is available on PyPi and can be installed with pip:\n\n.. code:: console\n\n pip install distogram\n\n\nPlay With Me\n============\n\nYou can test this library directly on this\n`live notebook <https://mybinder.org/v2/gh/maki-nage/distogram/master?urlpath=notebooks%2Fexamples%2Fdistogram.ipynb>`__.\n\n\nPerformances\n=============\n\nDistogram is design for fast updates when using python types. The following\nnumbers show the results of the benchmark program located in the examples. \n\nOn a i7-9800X Intel CPU, performances are:\n\n============ ========== ======= ==========\nInterpreter Operation Numpy Req/s\n============ ========== ======= ==========\npypy 7.3 update no 6563311\npypy 7.3 update yes 111318\nCPython 3.7 update no 436709\nCPython 3.7 update yes 251603\n============ ========== ======= ==========\n\nOn a modest 2014 13\" macbook pro, performances are:\n\n============ ========== ======= ==========\nInterpreter Operation Numpy Req/s\n============ ========== ======= ==========\npypy 7.3 update no 3572436\npypy 7.3 update yes 37630\nCPython 3.7 update no 112749\nCPython 3.7 update yes 81005\n============ ========== ======= ==========\n\nAs you can see, your are encouraged to use pypy with python native types. Pypy's\njit is penalised by numpy native types, causing a huge performance hit. Moreover\nthe streaming phylosophy of Distogram is more adapted to python native types\nwhile numpy is optimized for batch computations, even with CPython.\n\n\nCredits\n========\n\nAlthough this code has been written by following the aforementioned research\npaper, some parts are also inspired by the implementation from\n`Carson Farmer <https://github.com/carsonfarmer/streamhist>`__.\n\nThanks to `John Belmonte <https://github.com/belm0>`_ for his help on\nperformances and accuracy improvements.\n\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A library to compute histograms on distributed environments, on streaming data",
"version": "3.0.0",
"project_urls": {
"Documentation": "https://distogram.readthedocs.io",
"Homepage": "https://github.com/maki-nage/distogram.git"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "ae16b94f28021935829b8b50c7c3006e88f8ed0aa8058d96457b1bcae47d7923",
"md5": "e0646b0d2c35ed6d46ab7a0d62ce304b",
"sha256": "73b7381a2a4ab7bd51fcd4caf5afde791dc84f6feac5bf2aaaec3d3ca8821256"
},
"downloads": -1,
"filename": "distogram-3.0.0.tar.gz",
"has_sig": false,
"md5_digest": "e0646b0d2c35ed6d46ab7a0d62ce304b",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 12246,
"upload_time": "2022-02-05T22:00:41",
"upload_time_iso_8601": "2022-02-05T22:00:41.374277Z",
"url": "https://files.pythonhosted.org/packages/ae/16/b94f28021935829b8b50c7c3006e88f8ed0aa8058d96457b1bcae47d7923/distogram-3.0.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2022-02-05 22:00:41",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "maki-nage",
"github_project": "distogram",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [],
"lcname": "distogram"
}