Name | aprxc JSON |
Version |
1.1.3
JSON |
| download |
home_page | None |
Summary | A command-line tool to estimate the number of distinct lines in a file/stream using Chakraborty/Vinodchandran/Meelโs approximation algorithm. |
upload_time | 2024-08-26 09:35:55 |
maintainer | None |
docs_url | None |
author | None |
requires_python | >=3.11 |
license | None |
keywords |
algorithm
cli
computer-science
math
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# `aprxc` (ApproxiCount)
A command-line tool (and Python class) to approximate the number of distinct
elements in files (or a stream) using the *F0-Estimator* algorithm by S.
Chakraborty, N. V. Vinodchandran and K. S. Meel, as described in their 2023
paper "Distinct Elements in Streams: An Algorithm for the (Text) Book"
(https://arxiv.org/pdf/2301.10191#section.2).
**Motivation (elevator pitch):** Easier to remember and always faster than `sort
| uniq -c | wc -l`. Uses a fixed amount of memory for huge datasets, unlike the
ever-growing footprint of `awk '!a[$0]++' | wc -l`. Counts accurately for the
first ~83k unique elements (on 64-bit systems), with a deviation of about 0.4โ1%
after that.
## Installation
Choose your preferred way:
```shell
pip install aprxc
uv tool install aprxc
```
Or test-run it in an isolated environment first, via [pipx run](https://pipx.pypa.io/) or [uvx](https://docs.astral.sh/uv/concepts/tools/):
```shell
pipx run aprxc --help
uvx aprxc --help
```
Lastly, as `aprxc.py` has no dependencies besides Python 3.11+, you can simply
download it, run it, put it your PATH, vendor it, etc.
## Features and shortcomings
Compared to sort/uniq:
- sort/uniq always uses less memory (about 30-50%).
- sort/uniq is about 5 times *slower*.
Compared to 'the awk construct':
- awk uses about the same amount of time (0.5x-2x).
- awk uses *much more* memory for large files. Basically linear to the file
size, while ApproxiCount has an upper bound. For typical multi-GiB files
this can mean factors of 20x-150x, e.g. 5GiB (awk) vs. 40MiB (aprxc).
Now let's address the elephant in the room: All these advantages (yes, the pro
and cons are pretty balanced, but overall one can say that `aprxc` performs
generally better than the alternatives, especially with large data inputs) are
bought with **an inaccuracy in the reported counts**.
### About inaccuracy
But how inaccurate? In its default configuration you'll get a **mean inaccuracy
of about 0,4%**, with occasional **outliers around 1%**. For example, if the
script encounters 10M (`10_000_000`) actual unique values, the reported count is
typically ~40k off (e.g. `10_038_680`), sometimes ~100k (e.g. `9_897_071`).
Here's an overview (highly unscientific!) of how the algorithm parameters ๐ and
๐ฟ (`--epsilon` and `--delta` on the command line) affect the inaccuracy. The
defaults of `0.1` for both values seem to strike a good balance (and a memorable
inaccuracy of ~1%). Epsilon is the 'main manipulation knob', and you can see
quite good how its value affects especially the maximum inaccuracy.
(For this overview I counted 10 million unique 32-character strings[^1], and _for
each_ iteration I checked the reported count and compared to the actual number
of unique items. 'Mean inacc.' is the mean inaccuracy across all 10M steps;
'max inacc.' is the highest off encountered; memory usage is the linux tool
`time`'s reported 'maxresident'; time usage is wall time.)
| ๐ | ๐ฟ | set size | mean inacc. | max inacc. | memory usage | time usage |
| ---- | --- | --------:| ----------- | ----------- | ---------------:| ------------:|
| 0.01 | 0.1 | 8318632 | 0.004% | 0.034% | 1155MiB (4418%) | 12.5s (162%) |
| 0.05 | 0.1 | 332746 | 0.17% | 0.43% | 70MiB (269%) | 9.5s (123%) |
| 0.1 | 0.1 | 83187 | __0.37%__ | __0.97%__ | 26MiB (100%) | 7.7s (100%) |
| 0.2 | 0.1 | 20797 | 0.68% | 2.16% | 17MiB (65%) | 7.3s (95%) |
| 0.5 | 0.5 | 3216 | 1.75% | 5.45% | 13MiB (36%) | 8.8s (114%) |
**Important (and nice feature):** In its default configuration, the algorithm
uses a set data structure with 83187 slots, meaning that until that number of
unique elements are encountered the reported counts are **exact**; only once
this limit is reached, the 'actual' approximation algorithm kicks in and numbers
will become estimations.
### Is it useful?
- You have to be okay with the inaccuracies, obviously.
- However, for small unique counts (less than 80k) the numbers are accurate and
the command might be easier to remember than the sort/uniq pipe or the awkward
awk construct.
- It's basically always faster than the sort/uniq pipe.
- If you are memory-constrained and want to deal with large files, it might be
an option.
- If you are working exploratory and don't care about exact numbers or you will
round them anyway in the end, this can save you time.
### The experimental 'top most common' feature
I've added a couple of lines of code to support a 'top most common' items
feature. An alternative to the `sort | uniq -c | sort -rn | head`-pipeline or
[Tim Bray's nice `topfew` tool (written in
Go)](https://github.com/timbray/topfew/).
It kinda works, butโฆ
- The counts are good, even surprisingly good, as for the whole base algorithm,
but definitely worse and not as reliable as the nice 1%-mean-inaccuracy for
the total count case.
- I lack the mathematical expertise to prove or disprove anything about that
feature.
- If you ask for a top 10 (`-t10` or `--top 10`), you mostly get what you
expect, but if the counts are close the lower ranks become 'unstable'; even
rank 1 and 2 sometimes switch places etc.
- Compared with `topfew` (I wondered if this approximation algorithm could be
_an optional_ flag for `topfew`), this Python code is impressively close to
the Go performance, especially if reading a lot of data from a pipe.
Unfortunately, I fear that this algorithm is not parallelizable. But I leave
that, and the re-implementation in Go or Rust, as an exercise for the reader
:)
- Just try it!
## Command-line interface
```shell
usage: aprxc [-h] [--top [X]] [--size SIZE] [--epsilon EPSILON]
[--delta DELTA] [--cheat | --no-cheat] [--verbose] [--version]
[--debug]
[path ...]
Estimate the number of distinct lines in a file or stream.
positional arguments:
path Input file path(s) and/or '-' for stdin (default:
stdin)
options:
-h, --help show this help message and exit
--top [X], -t [X] EXPERIMENTAL: Show X most common values. Off by
default. If enabled, X defaults to 10.
--size SIZE, -s SIZE Total amount of data items, if known in advance. (Can
be approximated.)
--epsilon EPSILON, -E EPSILON
--delta DELTA, -D DELTA
--cheat, --no-cheat Use 'total seen' number as upper bound for unique
count.
--verbose, -v
--version, -V show program's version number and exit
--debug
```
---
[^1]:
The benchmark script:
```shell
cat /dev/urandom | pv -q -L 1000M | base64 -w 32 | command time ./aprxc.py --debug --epsilon=0.1 --delta=0.1
```
Raw data
{
"_id": null,
"home_page": null,
"name": "aprxc",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.11",
"maintainer_email": null,
"keywords": "algorithm, cli, computer-science, math",
"author": null,
"author_email": "Fabian Neumann <dev@fabianneumann.de>",
"download_url": "https://files.pythonhosted.org/packages/0c/be/3ef54e82b82b05f1fa83341a3cadeaeed040a33e26eb893952241e802715/aprxc-1.1.3.tar.gz",
"platform": null,
"description": "# `aprxc` (ApproxiCount)\n\nA command-line tool (and Python class) to approximate the number of distinct\nelements in files (or a stream) using the *F0-Estimator* algorithm by S.\nChakraborty, N. V. Vinodchandran and K. S. Meel, as described in their 2023\npaper \"Distinct Elements in Streams: An Algorithm for the (Text) Book\"\n(https://arxiv.org/pdf/2301.10191#section.2).\n\n**Motivation (elevator pitch):** Easier to remember and always faster than `sort\n| uniq -c | wc -l`. Uses a fixed amount of memory for huge datasets, unlike the\never-growing footprint of `awk '!a[$0]++' | wc -l`. Counts accurately for the\nfirst ~83k unique elements (on 64-bit systems), with a deviation of about 0.4\u20131%\nafter that.\n\n## Installation\n\nChoose your preferred way:\n\n```shell\npip install aprxc\nuv tool install aprxc\n```\n\nOr test-run it in an isolated environment first, via [pipx run](https://pipx.pypa.io/) or [uvx](https://docs.astral.sh/uv/concepts/tools/):\n\n```shell\npipx run aprxc --help\nuvx aprxc --help\n```\n\nLastly, as `aprxc.py` has no dependencies besides Python 3.11+, you can simply\ndownload it, run it, put it your PATH, vendor it, etc.\n\n## Features and shortcomings\n\nCompared to sort/uniq:\n\n- sort/uniq always uses less memory (about 30-50%).\n- sort/uniq is about 5 times *slower*.\n\nCompared to 'the awk construct':\n\n- awk uses about the same amount of time (0.5x-2x).\n- awk uses *much more* memory for large files. Basically linear to the file\n size, while ApproxiCount has an upper bound. For typical multi-GiB files\n this can mean factors of 20x-150x, e.g. 5GiB (awk) vs. 40MiB (aprxc).\n\nNow let's address the elephant in the room: All these advantages (yes, the pro\nand cons are pretty balanced, but overall one can say that `aprxc` performs\ngenerally better than the alternatives, especially with large data inputs) are\nbought with **an inaccuracy in the reported counts**.\n\n### About inaccuracy\n\nBut how inaccurate? In its default configuration you'll get a **mean inaccuracy\nof about 0,4%**, with occasional **outliers around 1%**. For example, if the\nscript encounters 10M (`10_000_000`) actual unique values, the reported count is\ntypically ~40k off (e.g. `10_038_680`), sometimes ~100k (e.g. `9_897_071`).\n\nHere's an overview (highly unscientific!) of how the algorithm parameters \ud835\udf00 and\n\ud835\udeff (`--epsilon` and `--delta` on the command line) affect the inaccuracy. The\ndefaults of `0.1` for both values seem to strike a good balance (and a memorable\ninaccuracy of ~1%). Epsilon is the 'main manipulation knob', and you can see\nquite good how its value affects especially the maximum inaccuracy.\n\n(For this overview I counted 10 million unique 32-character strings[^1], and _for\neach_ iteration I checked the reported count and compared to the actual number\nof unique items. 'Mean inacc.' is the mean inaccuracy across all 10M steps;\n'max inacc.' is the highest off encountered; memory usage is the linux tool\n`time`'s reported 'maxresident'; time usage is wall time.)\n\n| \ud835\udf00 | \ud835\udeff | set size | mean inacc. | max inacc. | memory usage | time usage |\n| ---- | --- | --------:| ----------- | ----------- | ---------------:| ------------:|\n| 0.01 | 0.1 | 8318632 | 0.004% | 0.034% | 1155MiB (4418%) | 12.5s (162%) |\n| 0.05 | 0.1 | 332746 | 0.17% | 0.43% | 70MiB (269%) | 9.5s (123%) |\n| 0.1 | 0.1 | 83187 | __0.37%__ | __0.97%__ | 26MiB (100%) | 7.7s (100%) |\n| 0.2 | 0.1 | 20797 | 0.68% | 2.16% | 17MiB (65%) | 7.3s (95%) |\n| 0.5 | 0.5 | 3216 | 1.75% | 5.45% | 13MiB (36%) | 8.8s (114%) |\n\n**Important (and nice feature):** In its default configuration, the algorithm\nuses a set data structure with 83187 slots, meaning that until that number of\nunique elements are encountered the reported counts are **exact**; only once\nthis limit is reached, the 'actual' approximation algorithm kicks in and numbers\nwill become estimations.\n\n### Is it useful?\n\n- You have to be okay with the inaccuracies, obviously.\n- However, for small unique counts (less than 80k) the numbers are accurate and\n the command might be easier to remember than the sort/uniq pipe or the awkward\n awk construct.\n- It's basically always faster than the sort/uniq pipe.\n- If you are memory-constrained and want to deal with large files, it might be\n an option.\n- If you are working exploratory and don't care about exact numbers or you will\n round them anyway in the end, this can save you time.\n\n### The experimental 'top most common' feature\n\nI've added a couple of lines of code to support a 'top most common' items\nfeature. An alternative to the `sort | uniq -c | sort -rn | head`-pipeline or\n[Tim Bray's nice `topfew` tool (written in\nGo)](https://github.com/timbray/topfew/).\n\nIt kinda works, but\u2026\n\n- The counts are good, even surprisingly good, as for the whole base algorithm,\n but definitely worse and not as reliable as the nice 1%-mean-inaccuracy for\n the total count case.\n- I lack the mathematical expertise to prove or disprove anything about that\n feature.\n- If you ask for a top 10 (`-t10` or `--top 10`), you mostly get what you\n expect, but if the counts are close the lower ranks become 'unstable'; even\n rank 1 and 2 sometimes switch places etc.\n- Compared with `topfew` (I wondered if this approximation algorithm could be\n _an optional_ flag for `topfew`), this Python code is impressively close to\n the Go performance, especially if reading a lot of data from a pipe.\n Unfortunately, I fear that this algorithm is not parallelizable. But I leave\n that, and the re-implementation in Go or Rust, as an exercise for the reader\n :)\n- Just try it!\n\n## Command-line interface\n\n```shell\nusage: aprxc [-h] [--top [X]] [--size SIZE] [--epsilon EPSILON]\n [--delta DELTA] [--cheat | --no-cheat] [--verbose] [--version]\n [--debug]\n [path ...]\n\nEstimate the number of distinct lines in a file or stream.\n\npositional arguments:\n path Input file path(s) and/or '-' for stdin (default:\n stdin)\n\noptions:\n -h, --help show this help message and exit\n --top [X], -t [X] EXPERIMENTAL: Show X most common values. Off by\n default. If enabled, X defaults to 10.\n --size SIZE, -s SIZE Total amount of data items, if known in advance. (Can\n be approximated.)\n --epsilon EPSILON, -E EPSILON\n --delta DELTA, -D DELTA\n --cheat, --no-cheat Use 'total seen' number as upper bound for unique\n count.\n --verbose, -v\n --version, -V show program's version number and exit\n --debug\n```\n\n---\n\n[^1]:\n The benchmark script:\n\n ```shell\n cat /dev/urandom | pv -q -L 1000M | base64 -w 32 | command time ./aprxc.py --debug --epsilon=0.1 --delta=0.1\n ```\n",
"bugtrack_url": null,
"license": null,
"summary": "A command-line tool to estimate the number of distinct lines in a file/stream using Chakraborty/Vinodchandran/Meel\u2019s approximation algorithm.",
"version": "1.1.3",
"project_urls": {
"Codeberg": "https://codeberg.org/fa81/aprxc",
"GitHub": "https://github.com/hellp/aprxc"
},
"split_keywords": [
"algorithm",
" cli",
" computer-science",
" math"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "0b84dd8ece79ff57749c6f3a02e494c902efdef98ce2e3ac65db760ae573824f",
"md5": "dc4e08f9fa77ae2c08128815935d9134",
"sha256": "3eaf94439b802d94459115685e7541f1c9eaf52a6f62db709d209e07c2e2089c"
},
"downloads": -1,
"filename": "aprxc-1.1.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "dc4e08f9fa77ae2c08128815935d9134",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.11",
"size": 12284,
"upload_time": "2024-08-26T09:35:53",
"upload_time_iso_8601": "2024-08-26T09:35:53.909192Z",
"url": "https://files.pythonhosted.org/packages/0b/84/dd8ece79ff57749c6f3a02e494c902efdef98ce2e3ac65db760ae573824f/aprxc-1.1.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "0cbe3ef54e82b82b05f1fa83341a3cadeaeed040a33e26eb893952241e802715",
"md5": "adc17fcb7187d2e718a3337596615a99",
"sha256": "a73664d1fa1827b7fa55822961f5ce2701650faa691996887b79767ebce7f318"
},
"downloads": -1,
"filename": "aprxc-1.1.3.tar.gz",
"has_sig": false,
"md5_digest": "adc17fcb7187d2e718a3337596615a99",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.11",
"size": 11370,
"upload_time": "2024-08-26T09:35:55",
"upload_time_iso_8601": "2024-08-26T09:35:55.366513Z",
"url": "https://files.pythonhosted.org/packages/0c/be/3ef54e82b82b05f1fa83341a3cadeaeed040a33e26eb893952241e802715/aprxc-1.1.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-08-26 09:35:55",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "hellp",
"github_project": "aprxc",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "aprxc"
}