distogram


Namedistogram JSON
Version 3.0.0 PyPI version JSON
download
home_pagehttps://github.com/maki-nage/distogram.git
SummaryA library to compute histograms on distributed environments, on streaming data
upload_time2022-02-05 22:00:41
maintainer
docs_urlNone
authorRomain Picard
requires_python
licenseMIT
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            
==========
DistoGram
==========


.. image:: https://badge.fury.io/py/distogram.svg
    :target: https://badge.fury.io/py/distogram

.. image:: https://github.com/maki-nage/distogram/workflows/Python%20package/badge.svg
    :target: https://github.com/maki-nage/distogram/actions?query=workflow%3A%22Python+package%22
    :alt: Github WorkFlows

.. image:: https://img.shields.io/codecov/c/github/maki-nage/distogram?style=plastic&color=brightgreen&logo=codecov&style=for-the-badge
    :target: https://codecov.io/gh/maki-nage/distogram
    :alt: Coverage

.. image:: https://readthedocs.org/projects/distogram/badge/?version=latest
    :target: https://distogram.readthedocs.io/en/latest/?badge=latest
    :alt: Documentation Status

.. image:: https://mybinder.org/badge_logo.svg
    :target: https://mybinder.org/v2/gh/maki-nage/distogram/master?urlpath=notebooks%2Fexamples%2Fdistogram.ipynb


DistoGram is a library that allows to compute histogram on streaming data, in
distributed environments. The implementation follows the algorithms described in
Ben-Haim's `Streaming Parallel Decision Trees
<http://jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf>`__

Get Started
============

First create a compressed representation of a distribution:

.. code:: python

    import numpy as np
    import distogram

    distribution = np.random.normal(size=10000)

    # Create and feed distogram from distribution
    # on a real usage, data comes from an event stream
    h = distogram.Distogram()
    for i in distribution:
        h = distogram.update(h, i)


Compute statistics on the distribution:

.. code:: python

    nmin, nmax = distogram.bounds(h)
    print("count: {}".format(distogram.count(h)))
    print("mean: {}".format(distogram.mean(h)))
    print("stddev: {}".format(distogram.stddev(h)))
    print("min: {}".format(nmin))
    print("5%: {}".format(distogram.quantile(h, 0.05)))
    print("25%: {}".format(distogram.quantile(h, 0.25)))
    print("50%: {}".format(distogram.quantile(h, 0.50)))
    print("75%: {}".format(distogram.quantile(h, 0.75)))
    print("95%: {}".format(distogram.quantile(h, 0.95)))
    print("max: {}".format(nmax))


.. code:: console

    count: 10000
    mean: -0.005082954640481095
    stddev: 1.0028524290149186
    min: -3.5691130319855047
    5%: -1.6597242392338374
    25%: -0.6785107421744653
    50%: -0.008672960012168916
    75%: 0.6720718926935414
    95%: 1.6476822301131866
    max: 3.8800560034877427

Compute and display the histogram of the distribution:

.. code:: python

    hist = distogram.histogram(h)
    df_hist = pd.DataFrame(np.array(hist), columns=["bin", "count"])
    fig = px.bar(df_hist, x="bin", y="count", title="distogram")
    fig.update_layout(height=300)
    fig.show()

.. image:: docs/normal_histogram.png
  :scale: 60%
  :align: center

Install
========

DistoGram is available on PyPi and can be installed with pip:

.. code:: console

    pip install distogram


Play With Me
============

You can test this library directly on this
`live notebook <https://mybinder.org/v2/gh/maki-nage/distogram/master?urlpath=notebooks%2Fexamples%2Fdistogram.ipynb>`__.


Performances
=============

Distogram is design for fast updates when using python types. The following
numbers show the results of the benchmark program located in the examples. 

On a i7-9800X Intel CPU, performances are:

============  ==========  =======  ==========
Interpreter   Operation   Numpy         Req/s
============  ==========  =======  ==========
pypy 7.3      update      no          6563311
pypy 7.3      update      yes          111318
CPython 3.7   update      no           436709
CPython 3.7   update      yes          251603
============  ==========  =======  ==========

On a modest 2014 13" macbook pro, performances are:

============  ==========  =======  ==========
Interpreter   Operation   Numpy         Req/s
============  ==========  =======  ==========
pypy 7.3      update      no          3572436
pypy 7.3      update      yes           37630
CPython 3.7   update      no           112749
CPython 3.7   update      yes           81005
============  ==========  =======  ==========

As you can see, your are encouraged to use pypy with python native types. Pypy's
jit is penalised by numpy native types, causing a huge performance hit. Moreover
the streaming phylosophy of Distogram is more adapted to python native types
while numpy is optimized for batch computations, even with CPython.


Credits
========

Although this code has been written by following the aforementioned research
paper, some parts are also inspired by the implementation from
`Carson Farmer <https://github.com/carsonfarmer/streamhist>`__.

Thanks to `John Belmonte <https://github.com/belm0>`_ for his help on
performances and accuracy improvements.



            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/maki-nage/distogram.git",
    "name": "distogram",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "",
    "author": "Romain Picard",
    "author_email": "romain.picard@oakbits.com",
    "download_url": "https://files.pythonhosted.org/packages/ae/16/b94f28021935829b8b50c7c3006e88f8ed0aa8058d96457b1bcae47d7923/distogram-3.0.0.tar.gz",
    "platform": "any",
    "description": "\n==========\nDistoGram\n==========\n\n\n.. image:: https://badge.fury.io/py/distogram.svg\n    :target: https://badge.fury.io/py/distogram\n\n.. image:: https://github.com/maki-nage/distogram/workflows/Python%20package/badge.svg\n    :target: https://github.com/maki-nage/distogram/actions?query=workflow%3A%22Python+package%22\n    :alt: Github WorkFlows\n\n.. image:: https://img.shields.io/codecov/c/github/maki-nage/distogram?style=plastic&color=brightgreen&logo=codecov&style=for-the-badge\n    :target: https://codecov.io/gh/maki-nage/distogram\n    :alt: Coverage\n\n.. image:: https://readthedocs.org/projects/distogram/badge/?version=latest\n    :target: https://distogram.readthedocs.io/en/latest/?badge=latest\n    :alt: Documentation Status\n\n.. image:: https://mybinder.org/badge_logo.svg\n    :target: https://mybinder.org/v2/gh/maki-nage/distogram/master?urlpath=notebooks%2Fexamples%2Fdistogram.ipynb\n\n\nDistoGram is a library that allows to compute histogram on streaming data, in\ndistributed environments. The implementation follows the algorithms described in\nBen-Haim's `Streaming Parallel Decision Trees\n<http://jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf>`__\n\nGet Started\n============\n\nFirst create a compressed representation of a distribution:\n\n.. code:: python\n\n    import numpy as np\n    import distogram\n\n    distribution = np.random.normal(size=10000)\n\n    # Create and feed distogram from distribution\n    # on a real usage, data comes from an event stream\n    h = distogram.Distogram()\n    for i in distribution:\n        h = distogram.update(h, i)\n\n\nCompute statistics on the distribution:\n\n.. code:: python\n\n    nmin, nmax = distogram.bounds(h)\n    print(\"count: {}\".format(distogram.count(h)))\n    print(\"mean: {}\".format(distogram.mean(h)))\n    print(\"stddev: {}\".format(distogram.stddev(h)))\n    print(\"min: {}\".format(nmin))\n    print(\"5%: {}\".format(distogram.quantile(h, 0.05)))\n    print(\"25%: {}\".format(distogram.quantile(h, 0.25)))\n    print(\"50%: {}\".format(distogram.quantile(h, 0.50)))\n    print(\"75%: {}\".format(distogram.quantile(h, 0.75)))\n    print(\"95%: {}\".format(distogram.quantile(h, 0.95)))\n    print(\"max: {}\".format(nmax))\n\n\n.. code:: console\n\n    count: 10000\n    mean: -0.005082954640481095\n    stddev: 1.0028524290149186\n    min: -3.5691130319855047\n    5%: -1.6597242392338374\n    25%: -0.6785107421744653\n    50%: -0.008672960012168916\n    75%: 0.6720718926935414\n    95%: 1.6476822301131866\n    max: 3.8800560034877427\n\nCompute and display the histogram of the distribution:\n\n.. code:: python\n\n    hist = distogram.histogram(h)\n    df_hist = pd.DataFrame(np.array(hist), columns=[\"bin\", \"count\"])\n    fig = px.bar(df_hist, x=\"bin\", y=\"count\", title=\"distogram\")\n    fig.update_layout(height=300)\n    fig.show()\n\n.. image:: docs/normal_histogram.png\n  :scale: 60%\n  :align: center\n\nInstall\n========\n\nDistoGram is available on PyPi and can be installed with pip:\n\n.. code:: console\n\n    pip install distogram\n\n\nPlay With Me\n============\n\nYou can test this library directly on this\n`live notebook <https://mybinder.org/v2/gh/maki-nage/distogram/master?urlpath=notebooks%2Fexamples%2Fdistogram.ipynb>`__.\n\n\nPerformances\n=============\n\nDistogram is design for fast updates when using python types. The following\nnumbers show the results of the benchmark program located in the examples. \n\nOn a i7-9800X Intel CPU, performances are:\n\n============  ==========  =======  ==========\nInterpreter   Operation   Numpy         Req/s\n============  ==========  =======  ==========\npypy 7.3      update      no          6563311\npypy 7.3      update      yes          111318\nCPython 3.7   update      no           436709\nCPython 3.7   update      yes          251603\n============  ==========  =======  ==========\n\nOn a modest 2014 13\" macbook pro, performances are:\n\n============  ==========  =======  ==========\nInterpreter   Operation   Numpy         Req/s\n============  ==========  =======  ==========\npypy 7.3      update      no          3572436\npypy 7.3      update      yes           37630\nCPython 3.7   update      no           112749\nCPython 3.7   update      yes           81005\n============  ==========  =======  ==========\n\nAs you can see, your are encouraged to use pypy with python native types. Pypy's\njit is penalised by numpy native types, causing a huge performance hit. Moreover\nthe streaming phylosophy of Distogram is more adapted to python native types\nwhile numpy is optimized for batch computations, even with CPython.\n\n\nCredits\n========\n\nAlthough this code has been written by following the aforementioned research\npaper, some parts are also inspired by the implementation from\n`Carson Farmer <https://github.com/carsonfarmer/streamhist>`__.\n\nThanks to `John Belmonte <https://github.com/belm0>`_ for his help on\nperformances and accuracy improvements.\n\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A library to compute histograms on distributed environments, on streaming data",
    "version": "3.0.0",
    "project_urls": {
        "Documentation": "https://distogram.readthedocs.io",
        "Homepage": "https://github.com/maki-nage/distogram.git"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ae16b94f28021935829b8b50c7c3006e88f8ed0aa8058d96457b1bcae47d7923",
                "md5": "e0646b0d2c35ed6d46ab7a0d62ce304b",
                "sha256": "73b7381a2a4ab7bd51fcd4caf5afde791dc84f6feac5bf2aaaec3d3ca8821256"
            },
            "downloads": -1,
            "filename": "distogram-3.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "e0646b0d2c35ed6d46ab7a0d62ce304b",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 12246,
            "upload_time": "2022-02-05T22:00:41",
            "upload_time_iso_8601": "2022-02-05T22:00:41.374277Z",
            "url": "https://files.pythonhosted.org/packages/ae/16/b94f28021935829b8b50c7c3006e88f8ed0aa8058d96457b1bcae47d7923/distogram-3.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2022-02-05 22:00:41",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "maki-nage",
    "github_project": "distogram",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "distogram"
}
        
Elapsed time: 0.17164s