count-distinct


Namecount-distinct JSON
Version 0.1.16 PyPI version JSON
download
home_pageNone
SummaryUse the CVM algorithm to quickly estimate the number of distinct elements in a stream
upload_time2024-07-05 13:11:52
maintainerNone
docs_urlNone
authorNone
requires_python>=3.10
licenseNone
keywords cvm count-distinct estimation
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # A Python implementation of the CVM Algorithm for Counting Distinct Elements

This library implements the algorithm described in

> Chakraborty, S., Vinodchandran, N. V., & Meel, K. S. (2022). *Distinct Elements in Streams: An Algorithm for the (Text) Book*. 6 pages, 727571 bytes. https://doi.org/10.4230/LIPIcs.ESA.2022.34

The accompanying article in Quanta is here: https://www.quantamagazine.org/computer-scientists-invent-an-efficient-new-way-to-count-20240516/

This Python module leverages the `cvmcount` Rust library: https://github.com/urschrei/cvmcount

## What does that mean
The count-distinct problem, or cardinality-estimation problem refers to counting the number of distinct elements in a data stream with repeated elements. As a concrete example, imagine that you want to count the unique words in a book. If you have enough memory, you can keep track of every unique element you encounter. However, you may not have enough working memory due to resource constraints, or the number of potential elements may be enormous. This constraint is referred to as the bounded-storage constraint in the literature.

In order to overcome this constraint, streaming algorithms have been developed: [Flajolet-Martin](https://en.wikipedia.org/wiki/Flajolet–Martin_algorithm), LogLog, [HyperLogLog](https://en.wikipedia.org/wiki/HyperLogLog). The algorithm implemented by this library is an improvement on these in one particular sense: it is extremely simple. Instead of hashing, it uses a sampling method to compute an [unbiased estimate](https://www.statlect.com/glossary/unbiased-estimator#:~:text=An%20estimator%20of%20a%20given,Examples) of the cardinality.

# What is an Element
Any hashable Python object or primitive. Not `f32` / `f64`, however, as they don't form a total ordering due to the presence of NaN.

## Further Details
Don Knuth has written about the algorithm (he refers to it as **Algorithm D**) at https://cs.stanford.edu/~knuth/papers/cvm-note.pdf, and does a far better job than I do at explaining it. You will note that on p1 he describes the buffer he uses as a data structure – called a [treap](https://en.wikipedia.org/wiki/Treap#:~:text=7%20External%20links-,Description,(randomly%20chosen)%20numeric%20priority.) – as a binary tree
> "that’s capable of holding up to _s_ ordered pairs (_a_, _u_), where _a_ is an element of the stream and _u_ is a real number, 0 ≤ _u_ < 1."

where _s_ >= 1. Our implementation doesn't use a treap as a buffer; it uses a fast HashSet with the [FxHash](https://docs.rs/fxhash/latest/fxhash/) algorithm: we pay the hash cost when inserting, but search in step **D4** is `O(1)`. The library may switch to a treap implementation eventually.

# Installation
`pip install count_distinct`

# Usage
```python
from count_distinct import CVM

# values for epsilon, delta, and stream size are described in the docstring.
counter = CVM(0.8, 0.1, 1000)
counter.add(2)
counter.add(5)
# keep adding elements
count = counter.calculate_final_result()
# you can keep adding elements if you wish
```

# Perf
## Memory, 10e8 random 7-digit positive integers
Allocation of integer array: 763 MiB

`count_distinct`: 768 MiB

`np.unique`: 1.7 GiB

## Note
If you're thinking about using this library, you presumably know that it only provides an estimate (within the specified bounds), similar to something like HyperLogLog. You are trading accuracy for speed!


            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "count-distinct",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "cvm, count-distinct, estimation",
    "author": null,
    "author_email": "Stephan H\u00fcgel <urschrei@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/0d/1e/f7b8858139de46647f4a1c8775407b5e242d240eb01775c7239998e55c7f/count_distinct-0.1.16.tar.gz",
    "platform": null,
    "description": "# A Python implementation of the CVM Algorithm for Counting Distinct Elements\n\nThis library implements the algorithm described in\n\n> Chakraborty, S., Vinodchandran, N. V., & Meel, K. S. (2022). *Distinct Elements in Streams: An Algorithm for the (Text) Book*. 6 pages, 727571 bytes. https://doi.org/10.4230/LIPIcs.ESA.2022.34\n\nThe accompanying article in Quanta is here: https://www.quantamagazine.org/computer-scientists-invent-an-efficient-new-way-to-count-20240516/\n\nThis Python module leverages the `cvmcount` Rust library: https://github.com/urschrei/cvmcount\n\n## What does that mean\nThe count-distinct problem, or cardinality-estimation problem refers to counting the number of distinct elements in a data stream with repeated elements. As a concrete example, imagine that you want to count the unique words in a book. If you have enough memory, you can keep track of every unique element you encounter. However, you may not have enough working memory due to resource constraints, or the number of potential elements may be enormous. This constraint is referred to as the bounded-storage constraint in the literature.\n\nIn order to overcome this constraint, streaming algorithms have been developed: [Flajolet-Martin](https://en.wikipedia.org/wiki/Flajolet\u2013Martin_algorithm), LogLog, [HyperLogLog](https://en.wikipedia.org/wiki/HyperLogLog). The algorithm implemented by this library is an improvement on these in one particular sense: it is extremely simple. Instead of hashing, it uses a sampling method to compute an [unbiased estimate](https://www.statlect.com/glossary/unbiased-estimator#:~:text=An%20estimator%20of%20a%20given,Examples) of the cardinality.\n\n# What is an Element\nAny hashable Python object or primitive. Not `f32` / `f64`, however, as they don't form a total ordering due to the presence of NaN.\n\n## Further Details\nDon Knuth has written about the algorithm (he refers to it as **Algorithm D**) at https://cs.stanford.edu/~knuth/papers/cvm-note.pdf, and does a far better job than I do at explaining it. You will note that on p1 he describes the buffer he uses as a data structure \u2013 called a [treap](https://en.wikipedia.org/wiki/Treap#:~:text=7%20External%20links-,Description,(randomly%20chosen)%20numeric%20priority.) \u2013 as a binary tree\n> \"that\u2019s capable of holding up to _s_ ordered pairs (_a_, _u_), where _a_ is an element of the stream and _u_ is a real number, 0 \u2264 _u_ < 1.\"\n\nwhere _s_ >= 1. Our implementation doesn't use a treap as a buffer; it uses a fast HashSet with the [FxHash](https://docs.rs/fxhash/latest/fxhash/) algorithm: we pay the hash cost when inserting, but search in step **D4** is `O(1)`. The library may switch to a treap implementation eventually.\n\n# Installation\n`pip install count_distinct`\n\n# Usage\n```python\nfrom count_distinct import CVM\n\n# values for epsilon, delta, and stream size are described in the docstring.\ncounter = CVM(0.8, 0.1, 1000)\ncounter.add(2)\ncounter.add(5)\n# keep adding elements\ncount = counter.calculate_final_result()\n# you can keep adding elements if you wish\n```\n\n# Perf\n## Memory, 10e8 random 7-digit positive integers\nAllocation of integer array: 763 MiB\n\n`count_distinct`: 768 MiB\n\n`np.unique`: 1.7 GiB\n\n## Note\nIf you're thinking about using this library, you presumably know that it only provides an estimate (within the specified bounds), similar to something like HyperLogLog. You are trading accuracy for speed!\n\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Use the CVM algorithm to quickly estimate the number of distinct elements in a stream",
    "version": "0.1.16",
    "project_urls": {
        "Repository": "https://github.com/urschrei/cvmcount_py",
        "Tracker": "https://github.com/urschrei/cvmcount_py/issues"
    },
    "split_keywords": [
        "cvm",
        " count-distinct",
        " estimation"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "3b75152578c29c27c3e4f6e57ce115adedb4b0603d7ebc1f00ab8870b763ed7c",
                "md5": "0a65067dcc2342d7b2f881b83ae2c985",
                "sha256": "6195123c6ea62b9e60f7f89218a140a4e5b0af1d0c893b3f7939d9189758d5e3"
            },
            "downloads": -1,
            "filename": "count_distinct-0.1.16-cp310-abi3-macosx_10_12_x86_64.whl",
            "has_sig": false,
            "md5_digest": "0a65067dcc2342d7b2f881b83ae2c985",
            "packagetype": "bdist_wheel",
            "python_version": "cp310",
            "requires_python": ">=3.10",
            "size": 217943,
            "upload_time": "2024-07-05T13:11:40",
            "upload_time_iso_8601": "2024-07-05T13:11:40.513080Z",
            "url": "https://files.pythonhosted.org/packages/3b/75/152578c29c27c3e4f6e57ce115adedb4b0603d7ebc1f00ab8870b763ed7c/count_distinct-0.1.16-cp310-abi3-macosx_10_12_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "0dade121477028cb9f29133599275a35bc56462a280df17c150fc4d64bb8d5a4",
                "md5": "ab534ed7022e8053904424cbfb33a730",
                "sha256": "cf0cf9218ec45b45c9f7967904bc527511719fabd6cc4ffc5f3252da785d4168"
            },
            "downloads": -1,
            "filename": "count_distinct-0.1.16-cp310-abi3-macosx_11_0_arm64.whl",
            "has_sig": false,
            "md5_digest": "ab534ed7022e8053904424cbfb33a730",
            "packagetype": "bdist_wheel",
            "python_version": "cp310",
            "requires_python": ">=3.10",
            "size": 212511,
            "upload_time": "2024-07-05T13:11:42",
            "upload_time_iso_8601": "2024-07-05T13:11:42.037145Z",
            "url": "https://files.pythonhosted.org/packages/0d/ad/e121477028cb9f29133599275a35bc56462a280df17c150fc4d64bb8d5a4/count_distinct-0.1.16-cp310-abi3-macosx_11_0_arm64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "5352eb36e363a5ec3635ca2db53dccb280b8c0228ab733e86f46b2e4b3266349",
                "md5": "843bb058634d5b1142fa3104f3565491",
                "sha256": "f0f9bb4b1bed68501cd743a8f54647af85b85283bf084b6fd956d776267e0b4d"
            },
            "downloads": -1,
            "filename": "count_distinct-0.1.16-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl",
            "has_sig": false,
            "md5_digest": "843bb058634d5b1142fa3104f3565491",
            "packagetype": "bdist_wheel",
            "python_version": "cp310",
            "requires_python": ">=3.10",
            "size": 247448,
            "upload_time": "2024-07-05T13:11:45",
            "upload_time_iso_8601": "2024-07-05T13:11:45.263673Z",
            "url": "https://files.pythonhosted.org/packages/53/52/eb36e363a5ec3635ca2db53dccb280b8c0228ab733e86f46b2e4b3266349/count_distinct-0.1.16-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "5460d0f9961cc1e7dc22a04ba7b1617b5faa8e5b7d4f9ef064df1a6d378d946b",
                "md5": "d8da289b10b2712081303e1d4306196d",
                "sha256": "fc146f1e8da087b7a47b15229943de9a6c67ceb4dd7dc50720e51a7a006015a4"
            },
            "downloads": -1,
            "filename": "count_distinct-0.1.16-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
            "has_sig": false,
            "md5_digest": "d8da289b10b2712081303e1d4306196d",
            "packagetype": "bdist_wheel",
            "python_version": "cp310",
            "requires_python": ">=3.10",
            "size": 242611,
            "upload_time": "2024-07-05T13:11:47",
            "upload_time_iso_8601": "2024-07-05T13:11:47.652287Z",
            "url": "https://files.pythonhosted.org/packages/54/60/d0f9961cc1e7dc22a04ba7b1617b5faa8e5b7d4f9ef064df1a6d378d946b/count_distinct-0.1.16-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "cb37ed8c173243a31eae78d89dcff8c539976778f9dad3a518aa878ffabd81f2",
                "md5": "bc547fa7103b1cebeb67d2d8d665033b",
                "sha256": "a4d1895648e7d85cf015a5cb2efae577073106f8268825a7b6cd1edb78750c34"
            },
            "downloads": -1,
            "filename": "count_distinct-0.1.16-cp310-abi3-manylinux_2_5_i686.manylinux1_i686.whl",
            "has_sig": false,
            "md5_digest": "bc547fa7103b1cebeb67d2d8d665033b",
            "packagetype": "bdist_wheel",
            "python_version": "cp310",
            "requires_python": ">=3.10",
            "size": 248634,
            "upload_time": "2024-07-05T13:11:49",
            "upload_time_iso_8601": "2024-07-05T13:11:49.326812Z",
            "url": "https://files.pythonhosted.org/packages/cb/37/ed8c173243a31eae78d89dcff8c539976778f9dad3a518aa878ffabd81f2/count_distinct-0.1.16-cp310-abi3-manylinux_2_5_i686.manylinux1_i686.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "f988b769098de15e35dc63be859c0a4efb5332a15208195f7e0d0240ad0b30dc",
                "md5": "a658b51da3ce5eb853f84aa7c70790d6",
                "sha256": "875f709d508c7e85a867ac32e7ccc397eef7f63b5351b4dcb97848ba5524b931"
            },
            "downloads": -1,
            "filename": "count_distinct-0.1.16-cp310-abi3-win32.whl",
            "has_sig": false,
            "md5_digest": "a658b51da3ce5eb853f84aa7c70790d6",
            "packagetype": "bdist_wheel",
            "python_version": "cp310",
            "requires_python": ">=3.10",
            "size": 128097,
            "upload_time": "2024-07-05T13:11:50",
            "upload_time_iso_8601": "2024-07-05T13:11:50.552476Z",
            "url": "https://files.pythonhosted.org/packages/f9/88/b769098de15e35dc63be859c0a4efb5332a15208195f7e0d0240ad0b30dc/count_distinct-0.1.16-cp310-abi3-win32.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "c4230ed145b71eb45e94ebbd4b9646760cdc6277ac60eb0c9935167fdf02e4ec",
                "md5": "80e1fbd5d4dd8ea4bbfc673d9cf1c42b",
                "sha256": "7dbdea76bc46c29d5ae8f0594d1b4c388ee00492eb8304f7138d76304270f4cb"
            },
            "downloads": -1,
            "filename": "count_distinct-0.1.16-cp310-abi3-win_amd64.whl",
            "has_sig": false,
            "md5_digest": "80e1fbd5d4dd8ea4bbfc673d9cf1c42b",
            "packagetype": "bdist_wheel",
            "python_version": "cp310",
            "requires_python": ">=3.10",
            "size": 137447,
            "upload_time": "2024-07-05T13:11:51",
            "upload_time_iso_8601": "2024-07-05T13:11:51.674330Z",
            "url": "https://files.pythonhosted.org/packages/c4/23/0ed145b71eb45e94ebbd4b9646760cdc6277ac60eb0c9935167fdf02e4ec/count_distinct-0.1.16-cp310-abi3-win_amd64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "0d1ef7b8858139de46647f4a1c8775407b5e242d240eb01775c7239998e55c7f",
                "md5": "c4f48e06fdca6a972546b3b0f74b8147",
                "sha256": "6de705706762758271b79dc536bd717f929cb3b686036aab850a24f5a66cca85"
            },
            "downloads": -1,
            "filename": "count_distinct-0.1.16.tar.gz",
            "has_sig": false,
            "md5_digest": "c4f48e06fdca6a972546b3b0f74b8147",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 9965,
            "upload_time": "2024-07-05T13:11:52",
            "upload_time_iso_8601": "2024-07-05T13:11:52.652749Z",
            "url": "https://files.pythonhosted.org/packages/0d/1e/f7b8858139de46647f4a1c8775407b5e242d240eb01775c7239998e55c7f/count_distinct-0.1.16.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-07-05 13:11:52",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "urschrei",
    "github_project": "cvmcount_py",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "count-distinct"
}
        
Elapsed time: 0.26134s