tdigest


Nametdigest JSON
Version 0.5.2.2 PyPI version JSON
download
home_pagehttps://github.com/CamDavidsonPilon/tdigest
SummaryT-Digest data structure
upload_time2019-05-07 18:57:40
maintainer
docs_urlNone
authorCam Davidson-pilon
requires_python
licenseMIT
keywords percentile median probabilistic data structure quantile distributed qdigest tdigest streaming pyspark
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI
coveralls test coverage No coveralls.
            # tdigest
### Efficient percentile estimation of streaming or distributed data
[![PyPI version](https://badge.fury.io/py/tdigest.svg)](https://badge.fury.io/py/tdigest)
[![Build Status](https://travis-ci.org/CamDavidsonPilon/tdigest.svg?branch=master)](https://travis-ci.org/CamDavidsonPilon/tdigest)


This is a Python implementation of Ted Dunning's [t-digest](https://github.com/tdunning/t-digest) data structure. The t-digest data structure is designed around computing accurate estimates from either streaming data, or distributed data. These estimates are percentiles, quantiles, trimmed means, etc. Two t-digests can be added, making the data structure ideal for map-reduce settings, and can be serialized into much less than 10kB (instead of storing the entire list of data).

See a blog post about it here: [Percentile and Quantile Estimation of Big Data: The t-Digest](http://dataorigami.net/blogs/napkin-folding/19055451-percentile-and-quantile-estimation-of-big-data-the-t-digest)


### Installation
*tdigest* is compatible with both Python 2 and Python 3. 

```
pip install tdigest
```

### Usage

#### Update the digest sequentially

```
from tdigest import TDigest
from numpy.random import random

digest = TDigest()
for x in range(5000):
    digest.update(random())

print(digest.percentile(15))  # about 0.15, as 0.15 is the 15th percentile of the Uniform(0,1) distribution
```

#### Update the digest in batches

```
another_digest = TDigest()
another_digest.batch_update(random(5000))
print(another_digest.percentile(15))
```

#### Sum two digests to create a new digest

```
sum_digest = digest + another_digest 
sum_digest.percentile(30)  # about 0.3
```

#### To dict or serializing a digest with JSON

You can use the to_dict() method to turn a TDigest object into a standard Python dictionary.
```
digest = TDigest()
digest.update(1)
digest.update(2)
digest.update(3)
print(digest.to_dict())
```
Or you can get only a list of Centroids with `centroids_to_list()`.
```
digest.centroids_to_list()
```

Similarly, you can restore a Python dict of digest values with `update_from_dict()`. Centroids are merged with any existing ones in the digest.
For example, make a fresh digest and restore values from a python dictionary.
```
digest = TDigest()
digest.update_from_dict({'K': 25, 'delta': 0.01, 'centroids': [{'c': 1.0, 'm': 1.0}, {'c': 1.0, 'm': 2.0}, {'c': 1.0, 'm': 3.0}]})
```

K and delta values are optional, or you can provide only a list of centroids with `update_centroids_from_list()`.
```
digest = TDigest()
digest.update_centroids([{'c': 1.0, 'm': 1.0}, {'c': 1.0, 'm': 2.0}, {'c': 1.0, 'm': 3.0}])
```

If you want to serialize with other tools like JSON, you can first convert to_dict().
```
json.dumps(digest.to_dict())
```

Alternatively, make a custom encoder function to provide as default to the standard json module.
```
def encoder(digest_obj):
    return digest_obj.to_dict()
```
Then pass the encoder function as the default parameter.
```
json.dumps(digest, default=encoder)
```


### API 

`TDigest.`

 - `update(x, w=1)`: update the tdigest with value `x` and weight `w`.
 - `batch_update(x, w=1)`: update the tdigest with values in array `x` and weight `w`.
 - `compress()`: perform a compression on the underlying data structure that will shrink the memory footprint of it, without hurting accuracy. Good to perform after adding many values. 
 - `percentile(p)`: return the `p`th percentile. Example: `p=50` is the median.
 - `cdf(x)`: return the CDF the value `x` is at. 
 - `trimmed_mean(p1, p2)`: return the mean of data set without the values below and above the `p1` and `p2` percentile respectively. 
 - `to_dict()`: return a Python dictionary of the TDigest and internal Centroid values.
 - `update_from_dict(dict_values)`: update from serialized dictionary values into the TDigest object.
 - `centroids_to_list()`: return a Python list of the TDigest object's internal Centroid values.
 - `update_centroids_from_list(list_values)`: update Centroids from a python list.








            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/CamDavidsonPilon/tdigest",
    "name": "tdigest",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "percentile,median,probabilistic data structure,quantile,distributed,qdigest,tdigest,streaming,pyspark",
    "author": "Cam Davidson-pilon",
    "author_email": "cam.davidson.pilon@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/dd/34/7e2f78d1ed0af7d0039ab2cff45b6bf8512234b9f178bb21713084a1f2f0/tdigest-0.5.2.2.tar.gz",
    "platform": "",
    "description": "# tdigest\n### Efficient percentile estimation of streaming or distributed data\n[![PyPI version](https://badge.fury.io/py/tdigest.svg)](https://badge.fury.io/py/tdigest)\n[![Build Status](https://travis-ci.org/CamDavidsonPilon/tdigest.svg?branch=master)](https://travis-ci.org/CamDavidsonPilon/tdigest)\n\n\nThis is a Python implementation of Ted Dunning's [t-digest](https://github.com/tdunning/t-digest) data structure. The t-digest data structure is designed around computing accurate estimates from either streaming data, or distributed data. These estimates are percentiles, quantiles, trimmed means, etc. Two t-digests can be added, making the data structure ideal for map-reduce settings, and can be serialized into much less than 10kB (instead of storing the entire list of data).\n\nSee a blog post about it here: [Percentile and Quantile Estimation of Big Data: The t-Digest](http://dataorigami.net/blogs/napkin-folding/19055451-percentile-and-quantile-estimation-of-big-data-the-t-digest)\n\n\n### Installation\n*tdigest* is compatible with both Python 2 and Python 3. \n\n```\npip install tdigest\n```\n\n### Usage\n\n#### Update the digest sequentially\n\n```\nfrom tdigest import TDigest\nfrom numpy.random import random\n\ndigest = TDigest()\nfor x in range(5000):\n    digest.update(random())\n\nprint(digest.percentile(15))  # about 0.15, as 0.15 is the 15th percentile of the Uniform(0,1) distribution\n```\n\n#### Update the digest in batches\n\n```\nanother_digest = TDigest()\nanother_digest.batch_update(random(5000))\nprint(another_digest.percentile(15))\n```\n\n#### Sum two digests to create a new digest\n\n```\nsum_digest = digest + another_digest \nsum_digest.percentile(30)  # about 0.3\n```\n\n#### To dict or serializing a digest with JSON\n\nYou can use the to_dict() method to turn a TDigest object into a standard Python dictionary.\n```\ndigest = TDigest()\ndigest.update(1)\ndigest.update(2)\ndigest.update(3)\nprint(digest.to_dict())\n```\nOr you can get only a list of Centroids with `centroids_to_list()`.\n```\ndigest.centroids_to_list()\n```\n\nSimilarly, you can restore a Python dict of digest values with `update_from_dict()`. Centroids are merged with any existing ones in the digest.\nFor example, make a fresh digest and restore values from a python dictionary.\n```\ndigest = TDigest()\ndigest.update_from_dict({'K': 25, 'delta': 0.01, 'centroids': [{'c': 1.0, 'm': 1.0}, {'c': 1.0, 'm': 2.0}, {'c': 1.0, 'm': 3.0}]})\n```\n\nK and delta values are optional, or you can provide only a list of centroids with `update_centroids_from_list()`.\n```\ndigest = TDigest()\ndigest.update_centroids([{'c': 1.0, 'm': 1.0}, {'c': 1.0, 'm': 2.0}, {'c': 1.0, 'm': 3.0}])\n```\n\nIf you want to serialize with other tools like JSON, you can first convert to_dict().\n```\njson.dumps(digest.to_dict())\n```\n\nAlternatively, make a custom encoder function to provide as default to the standard json module.\n```\ndef encoder(digest_obj):\n    return digest_obj.to_dict()\n```\nThen pass the encoder function as the default parameter.\n```\njson.dumps(digest, default=encoder)\n```\n\n\n### API \n\n`TDigest.`\n\n - `update(x, w=1)`: update the tdigest with value `x` and weight `w`.\n - `batch_update(x, w=1)`: update the tdigest with values in array `x` and weight `w`.\n - `compress()`: perform a compression on the underlying data structure that will shrink the memory footprint of it, without hurting accuracy. Good to perform after adding many values. \n - `percentile(p)`: return the `p`th percentile. Example: `p=50` is the median.\n - `cdf(x)`: return the CDF the value `x` is at. \n - `trimmed_mean(p1, p2)`: return the mean of data set without the values below and above the `p1` and `p2` percentile respectively. \n - `to_dict()`: return a Python dictionary of the TDigest and internal Centroid values.\n - `update_from_dict(dict_values)`: update from serialized dictionary values into the TDigest object.\n - `centroids_to_list()`: return a Python list of the TDigest object's internal Centroid values.\n - `update_centroids_from_list(list_values)`: update Centroids from a python list.\n\n\n\n\n\n\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "T-Digest data structure",
    "version": "0.5.2.2",
    "split_keywords": [
        "percentile",
        "median",
        "probabilistic data structure",
        "quantile",
        "distributed",
        "qdigest",
        "tdigest",
        "streaming",
        "pyspark"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "0be092d4caf62c7e54c27380664de896",
                "sha256": "e32ff6ab62e4defdb93b816c831080d94dfa1efb68a9fa1e7976c237fa9375cb"
            },
            "downloads": -1,
            "filename": "tdigest-0.5.2.2-py2.py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "0be092d4caf62c7e54c27380664de896",
            "packagetype": "bdist_wheel",
            "python_version": "py2.py3",
            "requires_python": null,
            "size": 9445,
            "upload_time": "2019-05-07T18:57:37",
            "upload_time_iso_8601": "2019-05-07T18:57:37.493014Z",
            "url": "https://files.pythonhosted.org/packages/32/72/f420480118cbdd18eb761b9936f0a927957130659a638449575b4a4f0aa7/tdigest-0.5.2.2-py2.py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "md5": "8655b11bc115465cf53acab1be3e0b11",
                "sha256": "dd25f8d6e6be002192bba9e4b8c16491d36c10b389f50637818603d1f67c6fb2"
            },
            "downloads": -1,
            "filename": "tdigest-0.5.2.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "8655b11bc115465cf53acab1be3e0b11",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 9440,
            "upload_time": "2019-05-07T18:57:38",
            "upload_time_iso_8601": "2019-05-07T18:57:38.942776Z",
            "url": "https://files.pythonhosted.org/packages/b4/94/fd3853b98f39d10206b08f2737d2ec2dc6f46a42dc7b7e05f4f0162d13ee/tdigest-0.5.2.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "md5": "07637824cb88ef904bb5dade8e7408d1",
                "sha256": "8deffc8bac024761786f43d9444e3b6c91008cd690323e051f068820a7364d0e"
            },
            "downloads": -1,
            "filename": "tdigest-0.5.2.2.tar.gz",
            "has_sig": false,
            "md5_digest": "07637824cb88ef904bb5dade8e7408d1",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 6549,
            "upload_time": "2019-05-07T18:57:40",
            "upload_time_iso_8601": "2019-05-07T18:57:40.771529Z",
            "url": "https://files.pythonhosted.org/packages/dd/34/7e2f78d1ed0af7d0039ab2cff45b6bf8512234b9f178bb21713084a1f2f0/tdigest-0.5.2.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2019-05-07 18:57:40",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "CamDavidsonPilon",
    "github_project": "tdigest",
    "travis_ci": true,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "tdigest"
}
        
Elapsed time: 0.01547s