pytdigest


Namepytdigest JSON
Version 0.1.4 PyPI version JSON
download
home_pagehttps://github.com/protivinsky/pytdigest
SummaryPython package for *fast* TDigest calculation.
upload_time2023-02-03 21:57:20
maintainer
docs_urlNone
authorTomas Protivinsky
requires_python>=3.7, <4
license
keywords tdigest distribution statistics
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            |pytest-badge| |doc-badge|

..  |pytest-badge| image:: https://github.com/protivinsky/pytdigest/actions/workflows/pytest.yaml/badge.svg
    :alt: pytest

..  |doc-badge| image:: https://github.com/protivinsky/pytdigest/actions/workflows/builddoc.yaml/badge.svg
    :alt: doc
    :target: https://protivinsky.github.io/pytdigest/index.html

PyTDigest: Fast streaming calculation of approximate quantiles
==============================================================

Python package for **fast** TDigest calculation.

- C implementation and thin Python wrapper
- suitable for big data and streaming and distributed settings
- developed for smooth compatibility with Pandas and numpy

Based on previous work of Ted Dunning and Andrew Werner.

- https://github.com/tdunning/t-digest
- https://github.com/ajwerner/tdigestc

.. warning::
    The package relies on C implementation that is compiled on host machine. I tested the compilation on Linux and
    on Windows 11, but I cannot ensure the compatibility with all operating systems. Let me know if you have encountered
    any issues.

Basic example
-------------

.. code:: python

    from pytdigest import TDigest
    import numpy as np
    import pandas as pd

    rng = np.random.default_rng(12354)
    n = 100_000
    x = rng.normal(loc=0, scale=10, size=n)
    w = rng.exponential(scale=1, size=n)

    # estimation from data is simple:
    td = TDigest.compute(x, w, compression=1_000)
    # td now contains "compressed" distribution: centroids with their means and weights

    # TDigest can be used to provide mean or approximate quantiles (i.e. inverse CDF):
    td.mean
    quantiles = [0., 0.01, 0.05, 0.25, 0.5, 0.75, 0.95, 0.99, 1.]
    td.inverse_cdf(quantiles)

    # these results are close to numpy values (note that numpy provides only unweighted quantiles)
    np.average(x, weights=w)  # mean should be exact
    np.quantile(x, quantiles)

    # TDigest can be copied
    td2 = td.copy()

    # and multiple TDigests can be added together to provide approximate quantiles for the overall dataset
    td + td2

Performance
-----------

Ted Dunning's original algorithm in Java takes about ~140 ns per addition on average. This package needs about ~200 ns
per addition when called from Python on numpy arrays, so the performance is fairly comparable with the original
implementation. All other tested TDigest implementations in Python are much slower.

.. code:: python

    import numpy as np
    from pytdigest import TDigest
    import time

    rng = np.random.Generator(np.random.PCG64(12345))

    for n in [100_000, 1_000_000, 10_000_000]:
        x = rng.normal(size=n)
        w = rng.exponential(size=n)

        start = time.time()
        td = TDigest.compute(x, w)
        end = time.time()
        print(f'PyTDigest, n = {n:,}: {end - start:.3g} seconds')

    # PyTDigest, n = 100,000: 0.0222 seconds
    # PyTDigest, n = 1,000,000: 0.21 seconds
    # PyTDigest, n = 10,000,000: 2.02 seconds

Similar packages
----------------

Several Python packages or wrappers exist for the TDigest algorithm.

tdigest
.......

The most popular on GitHub is a pure Python
`tdigest package
<https://github.com/CamDavidsonPilon/tdigest>`_. Pure Python implementation is indeed very slow – more than 100x
slower than this package:

.. code:: python

    import numpy as np
    from pytdigest import TDigest
    from tdigest import TDigest as TDigestPython
    import time

    rng = np.random.Generator(np.random.PCG64(12345))
    n = 100_000
    x = rng.normal(size=n)
    w = rng.exponential(size=n)

    start = time.time()
    td = TDigest.compute(x, w)
    end = time.time()
    print(f'PyTDigest: {end - start:.3g} seconds')
    # PyTDigest: 0.0246 seconds

    tdp = TDigestPython()
    start = time.time()
    tdp.batch_update(x)
    end = time.time()
    print(f'TDigest: {end - start:.3g} seconds')
    # TDigest: 7.26 seconds

Different weights for can be used in tdigest only with `update` method for adding a single observation.

t-digest CFFI
.............

Other package is `t-digest CFFI
<https://github.com/kpdemetriou/tdigest-cffi>`_, a thin Python wrapper over C implementation. It does not pass
batch updates into the C layer, so the iteration has to be done in python:

.. code:: python

    import numpy as np
    from tdigest import TDigest as TDigestCFFI
    import time

    rng = np.random.Generator(np.random.PCG64(12345))
    n = 100_000
    x = rng.normal(size=n)

    tdcffi = TDigestCFFI()
    start = time.time()
    for xx in x:
        tdcffi.insert(xx)
    end = time.time()
    print(f'TDigest-CFFI: {end - start:.3g} seconds')

Hence this package is still almost 20x slower than this package when used over numpy arrays. In addition, t-digest CFFI
package allows only for integer weights.

qtdigest
........

`qtdigest
<https://github.com/QunarOPS/qtdigest>`_'s own benchmarking states that 100 000 additions take about 1.7 s, so it is
again almost 100x slower than this package.

tdigestc
........

`tdigestc
<https://github.com/ajwerner/tdigestc>`_ by ajwerner is a simple C implementation with wrappers for different
languages. The Python wrapper is very basic, it is not published on PyPI and some functionality was missing
in the underlying C implementation (for instance support for batch updates based on numpy arrays), so I took this
package as the starting point and added several useful features for use as a standalone Python package.

Future plans
------------

There are several improvements that can be done in the future:

- TDigest can calculate exact variance in addition to mean.
- Alternating merging procedure (the centroids are always merged left to right in the C implementation,
  however Ted Dunning states that alternating merging improves the precision).
- Scaling function for merging centroids is hard-coded at the moment. Ted Dunning mentions several
  possible functions that can be used in merging.
- Centroids can store information about their variance - the resulting TDigest should be still
  composable and fast and it can work much better for discrete distributions.

Documentation
-------------

- https://protivinsky.github.io/pytdigest/index.html

Legal stuff
-----------

Apache License, Version 2.0,
http://www.apache.org/licenses/LICENSE-2.0

Copyright (c) 2015 Ted Dunning, All rights reserved.
     https://github.com/tdunning/t-digest
Copyright (c) 2018 Andrew Werner, All rights reserved.
     https://github.com/ajwerner/tdigestc
Copyright (c) 2022 Tomas Protivinsky, All rights reserved.
     https://github.com/protivinsky/pytdigest


            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/protivinsky/pytdigest",
    "name": "pytdigest",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.7, <4",
    "maintainer_email": "",
    "keywords": "tdigest,distribution,statistics",
    "author": "Tomas Protivinsky",
    "author_email": "tomas.protivinsky@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/fb/af/6c94278db5800bb24366955426fd23b7f90576edb14dfd01dffd3fe8fcec/pytdigest-0.1.4.tar.gz",
    "platform": null,
    "description": "|pytest-badge| |doc-badge|\n\n..  |pytest-badge| image:: https://github.com/protivinsky/pytdigest/actions/workflows/pytest.yaml/badge.svg\n    :alt: pytest\n\n..  |doc-badge| image:: https://github.com/protivinsky/pytdigest/actions/workflows/builddoc.yaml/badge.svg\n    :alt: doc\n    :target: https://protivinsky.github.io/pytdigest/index.html\n\nPyTDigest: Fast streaming calculation of approximate quantiles\n==============================================================\n\nPython package for **fast** TDigest calculation.\n\n- C implementation and thin Python wrapper\n- suitable for big data and streaming and distributed settings\n- developed for smooth compatibility with Pandas and numpy\n\nBased on previous work of Ted Dunning and Andrew Werner.\n\n- https://github.com/tdunning/t-digest\n- https://github.com/ajwerner/tdigestc\n\n.. warning::\n    The package relies on C implementation that is compiled on host machine. I tested the compilation on Linux and\n    on Windows 11, but I cannot ensure the compatibility with all operating systems. Let me know if you have encountered\n    any issues.\n\nBasic example\n-------------\n\n.. code:: python\n\n    from pytdigest import TDigest\n    import numpy as np\n    import pandas as pd\n\n    rng = np.random.default_rng(12354)\n    n = 100_000\n    x = rng.normal(loc=0, scale=10, size=n)\n    w = rng.exponential(scale=1, size=n)\n\n    # estimation from data is simple:\n    td = TDigest.compute(x, w, compression=1_000)\n    # td now contains \"compressed\" distribution: centroids with their means and weights\n\n    # TDigest can be used to provide mean or approximate quantiles (i.e. inverse CDF):\n    td.mean\n    quantiles = [0., 0.01, 0.05, 0.25, 0.5, 0.75, 0.95, 0.99, 1.]\n    td.inverse_cdf(quantiles)\n\n    # these results are close to numpy values (note that numpy provides only unweighted quantiles)\n    np.average(x, weights=w)  # mean should be exact\n    np.quantile(x, quantiles)\n\n    # TDigest can be copied\n    td2 = td.copy()\n\n    # and multiple TDigests can be added together to provide approximate quantiles for the overall dataset\n    td + td2\n\nPerformance\n-----------\n\nTed Dunning's original algorithm in Java takes about ~140 ns per addition on average. This package needs about ~200 ns\nper addition when called from Python on numpy arrays, so the performance is fairly comparable with the original\nimplementation. All other tested TDigest implementations in Python are much slower.\n\n.. code:: python\n\n    import numpy as np\n    from pytdigest import TDigest\n    import time\n\n    rng = np.random.Generator(np.random.PCG64(12345))\n\n    for n in [100_000, 1_000_000, 10_000_000]:\n        x = rng.normal(size=n)\n        w = rng.exponential(size=n)\n\n        start = time.time()\n        td = TDigest.compute(x, w)\n        end = time.time()\n        print(f'PyTDigest, n = {n:,}: {end - start:.3g} seconds')\n\n    # PyTDigest, n = 100,000: 0.0222 seconds\n    # PyTDigest, n = 1,000,000: 0.21 seconds\n    # PyTDigest, n = 10,000,000: 2.02 seconds\n\nSimilar packages\n----------------\n\nSeveral Python packages or wrappers exist for the TDigest algorithm.\n\ntdigest\n.......\n\nThe most popular on GitHub is a pure Python\n`tdigest package\n<https://github.com/CamDavidsonPilon/tdigest>`_. Pure Python implementation is indeed very slow \u2013 more than 100x\nslower than this package:\n\n.. code:: python\n\n    import numpy as np\n    from pytdigest import TDigest\n    from tdigest import TDigest as TDigestPython\n    import time\n\n    rng = np.random.Generator(np.random.PCG64(12345))\n    n = 100_000\n    x = rng.normal(size=n)\n    w = rng.exponential(size=n)\n\n    start = time.time()\n    td = TDigest.compute(x, w)\n    end = time.time()\n    print(f'PyTDigest: {end - start:.3g} seconds')\n    # PyTDigest: 0.0246 seconds\n\n    tdp = TDigestPython()\n    start = time.time()\n    tdp.batch_update(x)\n    end = time.time()\n    print(f'TDigest: {end - start:.3g} seconds')\n    # TDigest: 7.26 seconds\n\nDifferent weights for can be used in tdigest only with `update` method for adding a single observation.\n\nt-digest CFFI\n.............\n\nOther package is `t-digest CFFI\n<https://github.com/kpdemetriou/tdigest-cffi>`_, a thin Python wrapper over C implementation. It does not pass\nbatch updates into the C layer, so the iteration has to be done in python:\n\n.. code:: python\n\n    import numpy as np\n    from tdigest import TDigest as TDigestCFFI\n    import time\n\n    rng = np.random.Generator(np.random.PCG64(12345))\n    n = 100_000\n    x = rng.normal(size=n)\n\n    tdcffi = TDigestCFFI()\n    start = time.time()\n    for xx in x:\n        tdcffi.insert(xx)\n    end = time.time()\n    print(f'TDigest-CFFI: {end - start:.3g} seconds')\n\nHence this package is still almost 20x slower than this package when used over numpy arrays. In addition, t-digest CFFI\npackage allows only for integer weights.\n\nqtdigest\n........\n\n`qtdigest\n<https://github.com/QunarOPS/qtdigest>`_'s own benchmarking states that 100 000 additions take about 1.7 s, so it is\nagain almost 100x slower than this package.\n\ntdigestc\n........\n\n`tdigestc\n<https://github.com/ajwerner/tdigestc>`_ by ajwerner is a simple C implementation with wrappers for different\nlanguages. The Python wrapper is very basic, it is not published on PyPI and some functionality was missing\nin the underlying C implementation (for instance support for batch updates based on numpy arrays), so I took this\npackage as the starting point and added several useful features for use as a standalone Python package.\n\nFuture plans\n------------\n\nThere are several improvements that can be done in the future:\n\n- TDigest can calculate exact variance in addition to mean.\n- Alternating merging procedure (the centroids are always merged left to right in the C implementation,\n  however Ted Dunning states that alternating merging improves the precision).\n- Scaling function for merging centroids is hard-coded at the moment. Ted Dunning mentions several\n  possible functions that can be used in merging.\n- Centroids can store information about their variance - the resulting TDigest should be still\n  composable and fast and it can work much better for discrete distributions.\n\nDocumentation\n-------------\n\n- https://protivinsky.github.io/pytdigest/index.html\n\nLegal stuff\n-----------\n\nApache License, Version 2.0,\nhttp://www.apache.org/licenses/LICENSE-2.0\n\nCopyright (c) 2015 Ted Dunning, All rights reserved.\n     https://github.com/tdunning/t-digest\nCopyright (c) 2018 Andrew Werner, All rights reserved.\n     https://github.com/ajwerner/tdigestc\nCopyright (c) 2022 Tomas Protivinsky, All rights reserved.\n     https://github.com/protivinsky/pytdigest\n\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "Python package for *fast* TDigest calculation.",
    "version": "0.1.4",
    "split_keywords": [
        "tdigest",
        "distribution",
        "statistics"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "22557a1bd79521e99226a004665c6ef4370cbd7c4e5b17fccb2617be02eff664",
                "md5": "62e22d5a728a65da2c71edee5b60c6a1",
                "sha256": "e29e083ed4564fff0484ad8a9ba2e95d613bdef4d8c04caf05dbf455da76f59d"
            },
            "downloads": -1,
            "filename": "pytdigest-0.1.4-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.whl",
            "has_sig": false,
            "md5_digest": "62e22d5a728a65da2c71edee5b60c6a1",
            "packagetype": "bdist_wheel",
            "python_version": "cp310",
            "requires_python": ">=3.7, <4",
            "size": 28845,
            "upload_time": "2023-02-03T21:57:18",
            "upload_time_iso_8601": "2023-02-03T21:57:18.669365Z",
            "url": "https://files.pythonhosted.org/packages/22/55/7a1bd79521e99226a004665c6ef4370cbd7c4e5b17fccb2617be02eff664/pytdigest-0.1.4-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "fbaf6c94278db5800bb24366955426fd23b7f90576edb14dfd01dffd3fe8fcec",
                "md5": "b9b6dd16613bb3aa30290ee09f97e612",
                "sha256": "573dfee19bb0f26fc1f43d8a571b513f9d8f09ff1add98ce99826357e0070f0d"
            },
            "downloads": -1,
            "filename": "pytdigest-0.1.4.tar.gz",
            "has_sig": false,
            "md5_digest": "b9b6dd16613bb3aa30290ee09f97e612",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7, <4",
            "size": 18624,
            "upload_time": "2023-02-03T21:57:20",
            "upload_time_iso_8601": "2023-02-03T21:57:20.842290Z",
            "url": "https://files.pythonhosted.org/packages/fb/af/6c94278db5800bb24366955426fd23b7f90576edb14dfd01dffd3fe8fcec/pytdigest-0.1.4.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-02-03 21:57:20",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "protivinsky",
    "github_project": "pytdigest",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "test_requirements": [],
    "lcname": "pytdigest"
}
        
Elapsed time: 0.03682s