Overview
========
The HyperLogLog algorithm [1] is a space efficient method to estimate the
cardinality of extraordinarily large datasets. This module is written in C
for Python >= 3.6 and Python 2.7.x. It implements a 64 bit version of
HyperLogLog [2] using a Murmur64A hash.
Quick start
===========
Install Python development libraries. This step will depend on your OS. On
Ubuntu:
```
sudo apt install python-dev
```
Install HLL:
```
pip install HLL
```
Example usage:
```
from HLL import HyperLogLog
hll = HyperLogLog(10) # use 2^10 registers
hll.add('some data')
estimate = hll.cardinality()
print(estimate)
```
2.2 Changelog
=============
* Remove support for Python 2.7.
2.1 Changelog
=============
**Deprecation notice**: this is the last supported version for Python 2.7.x.
* Fixed bug where HyperLogLogs of unequal sizes could be merged.
* Fixed bug causing cardinality estimates to be off when repeatedly merging
sparse HyperLogLogs loaded from a pickle dump.
2.0 Changelog
=============
* Algorithm has been updated to a 64 bit version [2]. This fixes the
spike in relative error when switching from linear counting in the
original HyperLogLog algorithm.
* Hash function has been updated to the 64 bit Murmur64A function.
* More efficiently store registers using a combination of sparse and dense
representations.
* Improved method for counting the number of leading zeroes.
* Changed the return type of `cardinality()` from float to integer.
* Changed the return logic of `add()`. This method no longer always indicates
if a register was updated using its return value. This behavior is only
preserved in dense representation. In sparse representation, `add()` always
returns `False`.
* `HyperLogLog` objects pickled in 1.x and 2.x are not compatible.
* Added `get_register()`
* Added `hash()`
* Added `_get_meta()`
* Deprecated `murmur2_hash()`
* Deprecated `registers()`
* Deprecated `set_register()`
Documentation
=============
HyperLogLog objects
-------------------
`HyperLogLog` objects implement a 64 bit HyperLogLog algorithm [2]. They can
be used to estimate the cardinality of very large datasets. The estimation
accuracy is proportional to the number of registers. Using more registers
increases the accuracy and using less registers decreases the accuracy. The
number of registers is set in powers of 2 using the parameter `p` and defaults
to `p=12` or `2^12` registers.
```
>>> from hll import HyperLogLog
>>> hll = HyperLogLog() # Default to 2^12 registers
>>> hll.size()
4096
>>> hll = HyperLogLog(3) # Use 2^3 registers
>>> hll.size()
8
>>> for data in ['one', 'two', 'three', 'four',]:
... hll.add(data)
>>> hll.cardinality()
4L
```
HyperLogLogs use a Murmur64A hash. This function is fast and has a good
uniform distribution of bits which is necessary for accurate estimations. The
seed to this hash function can be set in the `HyperLogLog` constructor:
```
>>> hll = HyperLogLog(p=2, seed=123456789)
>>> hll.seed()
123456789
```
The hash function can also be called directly:
```
>>> hll.hash('something')
393810339
```
Individual registers can be printed:
```
>>> for i in range(2**4):
... print(hll.get_register(i))
0
0
3
0
4
```
`HyperLogLog` objects can be merged. This is done by taking the maximum value
of their respective registers:
```
>>> A = HyperLogLog(p=4)
>>> A.add('hello')
>>> B = HyperLogLog(p=4)
>>> B.add('world')
>>> A.merge(B)
>>> A.cardinality()
2
```
Register representation
-----------------------
Registers are stored using both sparse and dense representation. Originally
all registers are initialized to zero. However storing all these zeroes
individually is wasteful. Instead a sorted linked list [3] is used to store
only registers that have been set (e.g. have a non-zero value). When this list
reaches sufficient size the `HyperLogLog` object will switch to using dense
representation where registers are stored invidiaully using 6 bits.
Sparse representation can be disabled using the `sparse` flag:
```
>>> HyperLogLog(p=2, sparse=False)
```
The maximum list size for the sparse register list determines when the
`HyperLogLog` object switches to dense representation. This can be set
using `max_list_size`:
```
>>> HyperLogLog(p=15, max_list_size=10**6)
```
Traversing the sparse register list every time an item is added to the
`HyperLogLog` to update a register is expensive. A temporary buffer is instead
used to defer this operation. Items added to the `HyperLogLog` are first added
to the temporary buffer. When the buffer is full the items are sorted and then
any register updates occur. These updates can be done in one pass since both
the temproary buffer and sparse register list are sorted.
The buffer size can be set using `max_buffer_size`:
```
>>> HyperLogLog(p=15, max_buffer_size=10**5)
```
License
=======
This software is released under the [MIT License](LICENSE).
References
==========
[1] P. Flajolet, E. Fusy, O. Gandouet, F. Meunier. "HyperLogLog: the analysis
of a near-optimal cardinality estimation algorithm," Conference on the
Analysis of Algorithms 2007.
[2] O. Ertl, "New Cardinality Estimation Methods for HyperLogLog Sketches,"
arXiv:1706.07290 [cs], June 2017.
[3] S. Heule, M. Nunkesser, A. Hall. "HyperLogLog in Practice: Algorithimic
Engineering of a State of the Art Cardinality Estimation Algorithm,"
Proceedings of the EDBT 2013 Conference, ACM, Genoa March 2013.
Raw data
{
"_id": null,
"home_page": "https://github.com/ascv/HyperLogLog",
"name": "HLL",
"maintainer": "Joshua Andersen",
"docs_url": null,
"requires_python": "<4,>=3.6",
"maintainer_email": null,
"keywords": "algorithm, approximate counting, big data, big data, cardinality, cardinality estimate, counting, data analysis, data processing, data science, data sketching, efficient computation, estimating cardinality, fast, frequency estimation, hyper log log, hyper loglog, hyperloglog, large-scale data, log log, loglog, memory efficient, probability estimate, probability sketch, probablistic counting, probablistic data structures, real-time analytics, scalable, set cardinality, set operations, sketch, statistical analysis, streaming algorithms, streaming algorithms, unique count, unique element counting",
"author": "Joshua Andersen",
"author_email": "josh.h.andersen@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/f7/08/0138ce29426afacf129586df0d64c2e984877fc21bcf834e5c20fb0d15c2/HLL-2.2.0.tar.gz",
"platform": null,
"description": "Overview\n========\n\nThe HyperLogLog algorithm [1] is a space efficient method to estimate the\ncardinality of extraordinarily large datasets. This module is written in C\nfor Python >= 3.6 and Python 2.7.x. It implements a 64 bit version of\nHyperLogLog [2] using a Murmur64A hash.\n\n\nQuick start\n===========\n\nInstall Python development libraries. This step will depend on your OS. On\nUbuntu:\n```\nsudo apt install python-dev\n```\n\nInstall HLL:\n```\npip install HLL\n```\n\nExample usage:\n```\nfrom HLL import HyperLogLog\n\nhll = HyperLogLog(10) # use 2^10 registers\nhll.add('some data')\n\nestimate = hll.cardinality()\nprint(estimate)\n```\n\n2.2 Changelog\n=============\n\n* Remove support for Python 2.7.\n\n2.1 Changelog\n=============\n\n**Deprecation notice**: this is the last supported version for Python 2.7.x.\n\n* Fixed bug where HyperLogLogs of unequal sizes could be merged.\n* Fixed bug causing cardinality estimates to be off when repeatedly merging\n sparse HyperLogLogs loaded from a pickle dump.\n\n2.0 Changelog\n=============\n\n* Algorithm has been updated to a 64 bit version [2]. This fixes the\n spike in relative error when switching from linear counting in the\n original HyperLogLog algorithm.\n* Hash function has been updated to the 64 bit Murmur64A function.\n* More efficiently store registers using a combination of sparse and dense\n representations.\n* Improved method for counting the number of leading zeroes.\n* Changed the return type of `cardinality()` from float to integer.\n* Changed the return logic of `add()`. This method no longer always indicates\n if a register was updated using its return value. This behavior is only\n preserved in dense representation. In sparse representation, `add()` always\n returns `False`.\n* `HyperLogLog` objects pickled in 1.x and 2.x are not compatible.\n* Added `get_register()`\n* Added `hash()`\n* Added `_get_meta()`\n* Deprecated `murmur2_hash()`\n* Deprecated `registers()`\n* Deprecated `set_register()`\n\nDocumentation\n=============\n\nHyperLogLog objects\n-------------------\n\n`HyperLogLog` objects implement a 64 bit HyperLogLog algorithm [2]. They can\nbe used to estimate the cardinality of very large datasets. The estimation\naccuracy is proportional to the number of registers. Using more registers\nincreases the accuracy and using less registers decreases the accuracy. The\nnumber of registers is set in powers of 2 using the parameter `p` and defaults\nto `p=12` or `2^12` registers.\n```\n>>> from hll import HyperLogLog\n>>> hll = HyperLogLog() # Default to 2^12 registers\n>>> hll.size()\n4096\n>>> hll = HyperLogLog(3) # Use 2^3 registers\n>>> hll.size()\n8\n>>> for data in ['one', 'two', 'three', 'four',]:\n... hll.add(data)\n>>> hll.cardinality()\n4L\n```\n\nHyperLogLogs use a Murmur64A hash. This function is fast and has a good\nuniform distribution of bits which is necessary for accurate estimations. The\nseed to this hash function can be set in the `HyperLogLog` constructor:\n```\n>>> hll = HyperLogLog(p=2, seed=123456789)\n>>> hll.seed()\n123456789\n```\n\nThe hash function can also be called directly:\n```\n>>> hll.hash('something')\n393810339\n```\n\nIndividual registers can be printed:\n```\n>>> for i in range(2**4):\n... print(hll.get_register(i))\n0\n0\n3\n0\n4\n```\n\n`HyperLogLog` objects can be merged. This is done by taking the maximum value\nof their respective registers:\n```\n>>> A = HyperLogLog(p=4)\n>>> A.add('hello')\n>>> B = HyperLogLog(p=4)\n>>> B.add('world')\n>>> A.merge(B)\n>>> A.cardinality()\n2\n```\n\nRegister representation\n-----------------------\n\nRegisters are stored using both sparse and dense representation. Originally\nall registers are initialized to zero. However storing all these zeroes\nindividually is wasteful. Instead a sorted linked list [3] is used to store\nonly registers that have been set (e.g. have a non-zero value). When this list\nreaches sufficient size the `HyperLogLog` object will switch to using dense\nrepresentation where registers are stored invidiaully using 6 bits.\n\nSparse representation can be disabled using the `sparse` flag:\n```\n>>> HyperLogLog(p=2, sparse=False)\n```\n\nThe maximum list size for the sparse register list determines when the\n`HyperLogLog` object switches to dense representation. This can be set\nusing `max_list_size`:\n```\n>>> HyperLogLog(p=15, max_list_size=10**6)\n```\n\nTraversing the sparse register list every time an item is added to the\n`HyperLogLog` to update a register is expensive. A temporary buffer is instead\nused to defer this operation. Items added to the `HyperLogLog` are first added\nto the temporary buffer. When the buffer is full the items are sorted and then\nany register updates occur. These updates can be done in one pass since both\nthe temproary buffer and sparse register list are sorted.\n\nThe buffer size can be set using `max_buffer_size`:\n```\n>>> HyperLogLog(p=15, max_buffer_size=10**5)\n```\n\nLicense\n=======\n\nThis software is released under the [MIT License](LICENSE).\n\nReferences\n==========\n\n[1] P. Flajolet, E. Fusy, O. Gandouet, F. Meunier. \"HyperLogLog: the analysis\n of a near-optimal cardinality estimation algorithm,\" Conference on the\n Analysis of Algorithms 2007.\n\n[2] O. Ertl, \"New Cardinality Estimation Methods for HyperLogLog Sketches,\"\n arXiv:1706.07290 [cs], June 2017.\n\n[3] S. Heule, M. Nunkesser, A. Hall. \"HyperLogLog in Practice: Algorithimic\n Engineering of a State of the Art Cardinality Estimation Algorithm,\"\n Proceedings of the EDBT 2013 Conference, ACM, Genoa March 2013.",
"bugtrack_url": null,
"license": "MIT",
"summary": "Fast HyperLogLog for Python",
"version": "2.2.0",
"project_urls": {
"Homepage": "https://github.com/ascv/HyperLogLog"
},
"split_keywords": [
"algorithm",
" approximate counting",
" big data",
" big data",
" cardinality",
" cardinality estimate",
" counting",
" data analysis",
" data processing",
" data science",
" data sketching",
" efficient computation",
" estimating cardinality",
" fast",
" frequency estimation",
" hyper log log",
" hyper loglog",
" hyperloglog",
" large-scale data",
" log log",
" loglog",
" memory efficient",
" probability estimate",
" probability sketch",
" probablistic counting",
" probablistic data structures",
" real-time analytics",
" scalable",
" set cardinality",
" set operations",
" sketch",
" statistical analysis",
" streaming algorithms",
" streaming algorithms",
" unique count",
" unique element counting"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "f7080138ce29426afacf129586df0d64c2e984877fc21bcf834e5c20fb0d15c2",
"md5": "58c86fedd054e95519e9163333d8480a",
"sha256": "036791fbf591804b82ebaf107b66a6ee1c0f6f483f8c83af84c4bccd9d30bdf8"
},
"downloads": -1,
"filename": "HLL-2.2.0.tar.gz",
"has_sig": false,
"md5_digest": "58c86fedd054e95519e9163333d8480a",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4,>=3.6",
"size": 13881,
"upload_time": "2024-08-11T00:05:41",
"upload_time_iso_8601": "2024-08-11T00:05:41.334930Z",
"url": "https://files.pythonhosted.org/packages/f7/08/0138ce29426afacf129586df0d64c2e984877fc21bcf834e5c20fb0d15c2/HLL-2.2.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-08-11 00:05:41",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "ascv",
"github_project": "HyperLogLog",
"travis_ci": true,
"coveralls": false,
"github_actions": false,
"lcname": "hll"
}