# tdigest
### Efficient percentile estimation of streaming or distributed data
[![PyPI version](https://badge.fury.io/py/tdigest.svg)](https://badge.fury.io/py/tdigest)
[![Build Status](https://travis-ci.org/CamDavidsonPilon/tdigest.svg?branch=master)](https://travis-ci.org/CamDavidsonPilon/tdigest)
This is a Python implementation of Ted Dunning's [t-digest](https://github.com/tdunning/t-digest) data structure. The t-digest data structure is designed around computing accurate estimates from either streaming data, or distributed data. These estimates are percentiles, quantiles, trimmed means, etc. Two t-digests can be added, making the data structure ideal for map-reduce settings, and can be serialized into much less than 10kB (instead of storing the entire list of data).
See a blog post about it here: [Percentile and Quantile Estimation of Big Data: The t-Digest](http://dataorigami.net/blogs/napkin-folding/19055451-percentile-and-quantile-estimation-of-big-data-the-t-digest)
### Installation
*tdigest* is compatible with both Python 2 and Python 3.
```
pip install tdigest
```
### Usage
#### Update the digest sequentially
```
from tdigest import TDigest
from numpy.random import random
digest = TDigest()
for x in range(5000):
digest.update(random())
print(digest.percentile(15)) # about 0.15, as 0.15 is the 15th percentile of the Uniform(0,1) distribution
```
#### Update the digest in batches
```
another_digest = TDigest()
another_digest.batch_update(random(5000))
print(another_digest.percentile(15))
```
#### Sum two digests to create a new digest
```
sum_digest = digest + another_digest
sum_digest.percentile(30) # about 0.3
```
#### To dict or serializing a digest with JSON
You can use the to_dict() method to turn a TDigest object into a standard Python dictionary.
```
digest = TDigest()
digest.update(1)
digest.update(2)
digest.update(3)
print(digest.to_dict())
```
Or you can get only a list of Centroids with `centroids_to_list()`.
```
digest.centroids_to_list()
```
Similarly, you can restore a Python dict of digest values with `update_from_dict()`. Centroids are merged with any existing ones in the digest.
For example, make a fresh digest and restore values from a python dictionary.
```
digest = TDigest()
digest.update_from_dict({'K': 25, 'delta': 0.01, 'centroids': [{'c': 1.0, 'm': 1.0}, {'c': 1.0, 'm': 2.0}, {'c': 1.0, 'm': 3.0}]})
```
K and delta values are optional, or you can provide only a list of centroids with `update_centroids_from_list()`.
```
digest = TDigest()
digest.update_centroids([{'c': 1.0, 'm': 1.0}, {'c': 1.0, 'm': 2.0}, {'c': 1.0, 'm': 3.0}])
```
If you want to serialize with other tools like JSON, you can first convert to_dict().
```
json.dumps(digest.to_dict())
```
Alternatively, make a custom encoder function to provide as default to the standard json module.
```
def encoder(digest_obj):
return digest_obj.to_dict()
```
Then pass the encoder function as the default parameter.
```
json.dumps(digest, default=encoder)
```
### API
`TDigest.`
- `update(x, w=1)`: update the tdigest with value `x` and weight `w`.
- `batch_update(x, w=1)`: update the tdigest with values in array `x` and weight `w`.
- `compress()`: perform a compression on the underlying data structure that will shrink the memory footprint of it, without hurting accuracy. Good to perform after adding many values.
- `percentile(p)`: return the `p`th percentile. Example: `p=50` is the median.
- `cdf(x)`: return the CDF the value `x` is at.
- `trimmed_mean(p1, p2)`: return the mean of data set without the values below and above the `p1` and `p2` percentile respectively.
- `to_dict()`: return a Python dictionary of the TDigest and internal Centroid values.
- `update_from_dict(dict_values)`: update from serialized dictionary values into the TDigest object.
- `centroids_to_list()`: return a Python list of the TDigest object's internal Centroid values.
- `update_centroids_from_list(list_values)`: update Centroids from a python list.
Raw data
{
"_id": null,
"home_page": "https://github.com/CamDavidsonPilon/tdigest",
"name": "tdigest",
"maintainer": "",
"docs_url": null,
"requires_python": "",
"maintainer_email": "",
"keywords": "percentile,median,probabilistic data structure,quantile,distributed,qdigest,tdigest,streaming,pyspark",
"author": "Cam Davidson-pilon",
"author_email": "cam.davidson.pilon@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/dd/34/7e2f78d1ed0af7d0039ab2cff45b6bf8512234b9f178bb21713084a1f2f0/tdigest-0.5.2.2.tar.gz",
"platform": "",
"description": "# tdigest\n### Efficient percentile estimation of streaming or distributed data\n[![PyPI version](https://badge.fury.io/py/tdigest.svg)](https://badge.fury.io/py/tdigest)\n[![Build Status](https://travis-ci.org/CamDavidsonPilon/tdigest.svg?branch=master)](https://travis-ci.org/CamDavidsonPilon/tdigest)\n\n\nThis is a Python implementation of Ted Dunning's [t-digest](https://github.com/tdunning/t-digest) data structure. The t-digest data structure is designed around computing accurate estimates from either streaming data, or distributed data. These estimates are percentiles, quantiles, trimmed means, etc. Two t-digests can be added, making the data structure ideal for map-reduce settings, and can be serialized into much less than 10kB (instead of storing the entire list of data).\n\nSee a blog post about it here: [Percentile and Quantile Estimation of Big Data: The t-Digest](http://dataorigami.net/blogs/napkin-folding/19055451-percentile-and-quantile-estimation-of-big-data-the-t-digest)\n\n\n### Installation\n*tdigest* is compatible with both Python 2 and Python 3. \n\n```\npip install tdigest\n```\n\n### Usage\n\n#### Update the digest sequentially\n\n```\nfrom tdigest import TDigest\nfrom numpy.random import random\n\ndigest = TDigest()\nfor x in range(5000):\n digest.update(random())\n\nprint(digest.percentile(15)) # about 0.15, as 0.15 is the 15th percentile of the Uniform(0,1) distribution\n```\n\n#### Update the digest in batches\n\n```\nanother_digest = TDigest()\nanother_digest.batch_update(random(5000))\nprint(another_digest.percentile(15))\n```\n\n#### Sum two digests to create a new digest\n\n```\nsum_digest = digest + another_digest \nsum_digest.percentile(30) # about 0.3\n```\n\n#### To dict or serializing a digest with JSON\n\nYou can use the to_dict() method to turn a TDigest object into a standard Python dictionary.\n```\ndigest = TDigest()\ndigest.update(1)\ndigest.update(2)\ndigest.update(3)\nprint(digest.to_dict())\n```\nOr you can get only a list of Centroids with `centroids_to_list()`.\n```\ndigest.centroids_to_list()\n```\n\nSimilarly, you can restore a Python dict of digest values with `update_from_dict()`. Centroids are merged with any existing ones in the digest.\nFor example, make a fresh digest and restore values from a python dictionary.\n```\ndigest = TDigest()\ndigest.update_from_dict({'K': 25, 'delta': 0.01, 'centroids': [{'c': 1.0, 'm': 1.0}, {'c': 1.0, 'm': 2.0}, {'c': 1.0, 'm': 3.0}]})\n```\n\nK and delta values are optional, or you can provide only a list of centroids with `update_centroids_from_list()`.\n```\ndigest = TDigest()\ndigest.update_centroids([{'c': 1.0, 'm': 1.0}, {'c': 1.0, 'm': 2.0}, {'c': 1.0, 'm': 3.0}])\n```\n\nIf you want to serialize with other tools like JSON, you can first convert to_dict().\n```\njson.dumps(digest.to_dict())\n```\n\nAlternatively, make a custom encoder function to provide as default to the standard json module.\n```\ndef encoder(digest_obj):\n return digest_obj.to_dict()\n```\nThen pass the encoder function as the default parameter.\n```\njson.dumps(digest, default=encoder)\n```\n\n\n### API \n\n`TDigest.`\n\n - `update(x, w=1)`: update the tdigest with value `x` and weight `w`.\n - `batch_update(x, w=1)`: update the tdigest with values in array `x` and weight `w`.\n - `compress()`: perform a compression on the underlying data structure that will shrink the memory footprint of it, without hurting accuracy. Good to perform after adding many values. \n - `percentile(p)`: return the `p`th percentile. Example: `p=50` is the median.\n - `cdf(x)`: return the CDF the value `x` is at. \n - `trimmed_mean(p1, p2)`: return the mean of data set without the values below and above the `p1` and `p2` percentile respectively. \n - `to_dict()`: return a Python dictionary of the TDigest and internal Centroid values.\n - `update_from_dict(dict_values)`: update from serialized dictionary values into the TDigest object.\n - `centroids_to_list()`: return a Python list of the TDigest object's internal Centroid values.\n - `update_centroids_from_list(list_values)`: update Centroids from a python list.\n\n\n\n\n\n\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "T-Digest data structure",
"version": "0.5.2.2",
"split_keywords": [
"percentile",
"median",
"probabilistic data structure",
"quantile",
"distributed",
"qdigest",
"tdigest",
"streaming",
"pyspark"
],
"urls": [
{
"comment_text": "",
"digests": {
"md5": "0be092d4caf62c7e54c27380664de896",
"sha256": "e32ff6ab62e4defdb93b816c831080d94dfa1efb68a9fa1e7976c237fa9375cb"
},
"downloads": -1,
"filename": "tdigest-0.5.2.2-py2.py3-none-any.whl",
"has_sig": false,
"md5_digest": "0be092d4caf62c7e54c27380664de896",
"packagetype": "bdist_wheel",
"python_version": "py2.py3",
"requires_python": null,
"size": 9445,
"upload_time": "2019-05-07T18:57:37",
"upload_time_iso_8601": "2019-05-07T18:57:37.493014Z",
"url": "https://files.pythonhosted.org/packages/32/72/f420480118cbdd18eb761b9936f0a927957130659a638449575b4a4f0aa7/tdigest-0.5.2.2-py2.py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"md5": "8655b11bc115465cf53acab1be3e0b11",
"sha256": "dd25f8d6e6be002192bba9e4b8c16491d36c10b389f50637818603d1f67c6fb2"
},
"downloads": -1,
"filename": "tdigest-0.5.2.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "8655b11bc115465cf53acab1be3e0b11",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 9440,
"upload_time": "2019-05-07T18:57:38",
"upload_time_iso_8601": "2019-05-07T18:57:38.942776Z",
"url": "https://files.pythonhosted.org/packages/b4/94/fd3853b98f39d10206b08f2737d2ec2dc6f46a42dc7b7e05f4f0162d13ee/tdigest-0.5.2.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"md5": "07637824cb88ef904bb5dade8e7408d1",
"sha256": "8deffc8bac024761786f43d9444e3b6c91008cd690323e051f068820a7364d0e"
},
"downloads": -1,
"filename": "tdigest-0.5.2.2.tar.gz",
"has_sig": false,
"md5_digest": "07637824cb88ef904bb5dade8e7408d1",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 6549,
"upload_time": "2019-05-07T18:57:40",
"upload_time_iso_8601": "2019-05-07T18:57:40.771529Z",
"url": "https://files.pythonhosted.org/packages/dd/34/7e2f78d1ed0af7d0039ab2cff45b6bf8512234b9f178bb21713084a1f2f0/tdigest-0.5.2.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2019-05-07 18:57:40",
"github": true,
"gitlab": false,
"bitbucket": false,
"github_user": "CamDavidsonPilon",
"github_project": "tdigest",
"travis_ci": true,
"coveralls": false,
"github_actions": false,
"requirements": [],
"lcname": "tdigest"
}