aprxc


Nameaprxc JSON
Version 1.1.3 PyPI version JSON
download
home_pageNone
SummaryA command-line tool to estimate the number of distinct lines in a file/stream using Chakraborty/Vinodchandran/Meelโ€™s approximation algorithm.
upload_time2024-08-26 09:35:55
maintainerNone
docs_urlNone
authorNone
requires_python>=3.11
licenseNone
keywords algorithm cli computer-science math
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # `aprxc` (ApproxiCount)

A command-line tool (and Python class) to approximate the number of distinct
elements in files (or a stream) using the *F0-Estimator* algorithm by S.
Chakraborty, N. V. Vinodchandran and K. S. Meel, as described in their 2023
paper "Distinct Elements in Streams: An Algorithm for the (Text) Book"
(https://arxiv.org/pdf/2301.10191#section.2).

**Motivation (elevator pitch):** Easier to remember and always faster than `sort
| uniq -c | wc -l`. Uses a fixed amount of memory for huge datasets, unlike the
ever-growing footprint of `awk '!a[$0]++' | wc -l`. Counts accurately for the
first ~83k unique elements (on 64-bit systems), with a deviation of about 0.4โ€“1%
after that.

## Installation

Choose your preferred way:

```shell
pip install aprxc
uv tool install aprxc
```

Or test-run it in an isolated environment first, via [pipx run](https://pipx.pypa.io/) or [uvx](https://docs.astral.sh/uv/concepts/tools/):

```shell
pipx run aprxc --help
uvx aprxc --help
```

Lastly, as `aprxc.py` has no dependencies besides Python 3.11+, you can simply
download it, run it, put it your PATH, vendor it, etc.

## Features and shortcomings

Compared to sort/uniq:

- sort/uniq always uses less memory (about 30-50%).
- sort/uniq is about 5 times *slower*.

Compared to 'the awk construct':

- awk uses about the same amount of time (0.5x-2x).
- awk uses *much more* memory for large files. Basically linear to the file
    size, while ApproxiCount has an upper bound. For typical multi-GiB files
    this can mean factors of 20x-150x, e.g. 5GiB (awk) vs. 40MiB (aprxc).

Now let's address the elephant in the room: All these advantages (yes, the pro
and cons are pretty balanced, but overall one can say that `aprxc` performs
generally better than the alternatives, especially with large data inputs) are
bought with **an inaccuracy in the reported counts**.

### About inaccuracy

But how inaccurate? In its default configuration you'll get a **mean inaccuracy
of about 0,4%**, with occasional **outliers around 1%**. For example, if the
script encounters 10M (`10_000_000`) actual unique values, the reported count is
typically ~40k off (e.g. `10_038_680`), sometimes ~100k (e.g. `9_897_071`).

Here's an overview (highly unscientific!) of how the algorithm parameters ๐œ€ and
๐›ฟ (`--epsilon` and `--delta` on the command line) affect the inaccuracy. The
defaults of `0.1` for both values seem to strike a good balance (and a memorable
inaccuracy of ~1%). Epsilon is the 'main manipulation knob', and you can see
quite good how its value affects especially the maximum inaccuracy.

(For this overview I counted 10 million unique 32-character strings[^1], and _for
each_ iteration I checked the reported count and compared to the actual number
of unique items. 'Mean inacc.' is the mean inaccuracy across all 10M steps;
'max inacc.' is the highest off encountered; memory usage is the linux tool
`time`'s reported 'maxresident'; time usage is wall time.)

|   ๐œ€  |  ๐›ฟ  | set size | mean inacc. | max inacc.  |   memory usage  |  time usage  |
| ---- | --- | --------:| ----------- | ----------- | ---------------:| ------------:|
| 0.01 | 0.1 |  8318632 |     0.004%  |     0.034%  | 1155MiB (4418%) | 12.5s (162%) |
| 0.05 | 0.1 |   332746 |     0.17%   |     0.43%   |   70MiB  (269%) |  9.5s (123%) |
| 0.1  | 0.1 |    83187 |   __0.37%__ |   __0.97%__ |   26MiB  (100%) |  7.7s (100%) |
| 0.2  | 0.1 |    20797 |     0.68%   |     2.16%   |   17MiB   (65%) |  7.3s  (95%) |
| 0.5  | 0.5 |     3216 |     1.75%   |     5.45%   |   13MiB   (36%) |  8.8s (114%) |

**Important (and nice feature):** In its default configuration, the algorithm
uses a set data structure with 83187 slots, meaning that until that number of
unique elements are encountered the reported counts are **exact**; only once
this limit is reached, the 'actual' approximation algorithm kicks in and numbers
will become estimations.

### Is it useful?

- You have to be okay with the inaccuracies, obviously.
- However, for small unique counts (less than 80k) the numbers are accurate and
  the command might be easier to remember than the sort/uniq pipe or the awkward
  awk construct.
- It's basically always faster than the sort/uniq pipe.
- If you are memory-constrained and want to deal with large files, it might be
  an option.
- If you are working exploratory and don't care about exact numbers or you will
  round them anyway in the end, this can save you time.

### The experimental 'top most common' feature

I've added a couple of lines of code to support a 'top most common' items
feature. An alternative to the `sort | uniq -c | sort -rn | head`-pipeline or
[Tim Bray's nice `topfew` tool (written in
Go)](https://github.com/timbray/topfew/).

It kinda works, butโ€ฆ

- The counts are good, even surprisingly good, as for the whole base algorithm,
  but definitely worse and not as reliable as the nice 1%-mean-inaccuracy for
  the total count case.
- I lack the mathematical expertise to prove or disprove anything about that
  feature.
- If you ask for a top 10 (`-t10` or `--top 10`), you mostly get what you
  expect, but if the counts are close the lower ranks become 'unstable'; even
  rank 1 and 2 sometimes switch places etc.
- Compared with `topfew` (I wondered if this approximation algorithm could be
  _an optional_ flag for `topfew`), this Python code is impressively close to
  the Go performance, especially if reading a lot of data from a pipe.
  Unfortunately, I fear that this algorithm is not parallelizable. But I leave
  that, and the re-implementation in Go or Rust, as an exercise for the reader
  :)
- Just try it!

## Command-line interface

```shell
usage: aprxc [-h] [--top [X]] [--size SIZE] [--epsilon EPSILON]
             [--delta DELTA] [--cheat | --no-cheat] [--verbose] [--version]
             [--debug]
             [path ...]

Estimate the number of distinct lines in a file or stream.

positional arguments:
  path                  Input file path(s) and/or '-' for stdin (default:
                        stdin)

options:
  -h, --help            show this help message and exit
  --top [X], -t [X]     EXPERIMENTAL: Show X most common values. Off by
                        default. If enabled, X defaults to 10.
  --size SIZE, -s SIZE  Total amount of data items, if known in advance. (Can
                        be approximated.)
  --epsilon EPSILON, -E EPSILON
  --delta DELTA, -D DELTA
  --cheat, --no-cheat   Use 'total seen' number as upper bound for unique
                        count.
  --verbose, -v
  --version, -V         show program's version number and exit
  --debug
```

---

[^1]:
    The benchmark script:

    ```shell
    cat /dev/urandom | pv -q -L 1000M | base64 -w 32 | command time ./aprxc.py --debug --epsilon=0.1 --delta=0.1
    ```

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "aprxc",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.11",
    "maintainer_email": null,
    "keywords": "algorithm, cli, computer-science, math",
    "author": null,
    "author_email": "Fabian Neumann <dev@fabianneumann.de>",
    "download_url": "https://files.pythonhosted.org/packages/0c/be/3ef54e82b82b05f1fa83341a3cadeaeed040a33e26eb893952241e802715/aprxc-1.1.3.tar.gz",
    "platform": null,
    "description": "# `aprxc` (ApproxiCount)\n\nA command-line tool (and Python class) to approximate the number of distinct\nelements in files (or a stream) using the *F0-Estimator* algorithm by S.\nChakraborty, N. V. Vinodchandran and K. S. Meel, as described in their 2023\npaper \"Distinct Elements in Streams: An Algorithm for the (Text) Book\"\n(https://arxiv.org/pdf/2301.10191#section.2).\n\n**Motivation (elevator pitch):** Easier to remember and always faster than `sort\n| uniq -c | wc -l`. Uses a fixed amount of memory for huge datasets, unlike the\never-growing footprint of `awk '!a[$0]++' | wc -l`. Counts accurately for the\nfirst ~83k unique elements (on 64-bit systems), with a deviation of about 0.4\u20131%\nafter that.\n\n## Installation\n\nChoose your preferred way:\n\n```shell\npip install aprxc\nuv tool install aprxc\n```\n\nOr test-run it in an isolated environment first, via [pipx run](https://pipx.pypa.io/) or [uvx](https://docs.astral.sh/uv/concepts/tools/):\n\n```shell\npipx run aprxc --help\nuvx aprxc --help\n```\n\nLastly, as `aprxc.py` has no dependencies besides Python 3.11+, you can simply\ndownload it, run it, put it your PATH, vendor it, etc.\n\n## Features and shortcomings\n\nCompared to sort/uniq:\n\n- sort/uniq always uses less memory (about 30-50%).\n- sort/uniq is about 5 times *slower*.\n\nCompared to 'the awk construct':\n\n- awk uses about the same amount of time (0.5x-2x).\n- awk uses *much more* memory for large files. Basically linear to the file\n    size, while ApproxiCount has an upper bound. For typical multi-GiB files\n    this can mean factors of 20x-150x, e.g. 5GiB (awk) vs. 40MiB (aprxc).\n\nNow let's address the elephant in the room: All these advantages (yes, the pro\nand cons are pretty balanced, but overall one can say that `aprxc` performs\ngenerally better than the alternatives, especially with large data inputs) are\nbought with **an inaccuracy in the reported counts**.\n\n### About inaccuracy\n\nBut how inaccurate? In its default configuration you'll get a **mean inaccuracy\nof about 0,4%**, with occasional **outliers around 1%**. For example, if the\nscript encounters 10M (`10_000_000`) actual unique values, the reported count is\ntypically ~40k off (e.g. `10_038_680`), sometimes ~100k (e.g. `9_897_071`).\n\nHere's an overview (highly unscientific!) of how the algorithm parameters \ud835\udf00 and\n\ud835\udeff (`--epsilon` and `--delta` on the command line) affect the inaccuracy. The\ndefaults of `0.1` for both values seem to strike a good balance (and a memorable\ninaccuracy of ~1%). Epsilon is the 'main manipulation knob', and you can see\nquite good how its value affects especially the maximum inaccuracy.\n\n(For this overview I counted 10 million unique 32-character strings[^1], and _for\neach_ iteration I checked the reported count and compared to the actual number\nof unique items. 'Mean inacc.' is the mean inaccuracy across all 10M steps;\n'max inacc.' is the highest off encountered; memory usage is the linux tool\n`time`'s reported 'maxresident'; time usage is wall time.)\n\n|   \ud835\udf00  |  \ud835\udeff  | set size | mean inacc. | max inacc.  |   memory usage  |  time usage  |\n| ---- | --- | --------:| ----------- | ----------- | ---------------:| ------------:|\n| 0.01 | 0.1 |  8318632 |     0.004%  |     0.034%  | 1155MiB (4418%) | 12.5s (162%) |\n| 0.05 | 0.1 |   332746 |     0.17%   |     0.43%   |   70MiB  (269%) |  9.5s (123%) |\n| 0.1  | 0.1 |    83187 |   __0.37%__ |   __0.97%__ |   26MiB  (100%) |  7.7s (100%) |\n| 0.2  | 0.1 |    20797 |     0.68%   |     2.16%   |   17MiB   (65%) |  7.3s  (95%) |\n| 0.5  | 0.5 |     3216 |     1.75%   |     5.45%   |   13MiB   (36%) |  8.8s (114%) |\n\n**Important (and nice feature):** In its default configuration, the algorithm\nuses a set data structure with 83187 slots, meaning that until that number of\nunique elements are encountered the reported counts are **exact**; only once\nthis limit is reached, the 'actual' approximation algorithm kicks in and numbers\nwill become estimations.\n\n### Is it useful?\n\n- You have to be okay with the inaccuracies, obviously.\n- However, for small unique counts (less than 80k) the numbers are accurate and\n  the command might be easier to remember than the sort/uniq pipe or the awkward\n  awk construct.\n- It's basically always faster than the sort/uniq pipe.\n- If you are memory-constrained and want to deal with large files, it might be\n  an option.\n- If you are working exploratory and don't care about exact numbers or you will\n  round them anyway in the end, this can save you time.\n\n### The experimental 'top most common' feature\n\nI've added a couple of lines of code to support a 'top most common' items\nfeature. An alternative to the `sort | uniq -c | sort -rn | head`-pipeline or\n[Tim Bray's nice `topfew` tool (written in\nGo)](https://github.com/timbray/topfew/).\n\nIt kinda works, but\u2026\n\n- The counts are good, even surprisingly good, as for the whole base algorithm,\n  but definitely worse and not as reliable as the nice 1%-mean-inaccuracy for\n  the total count case.\n- I lack the mathematical expertise to prove or disprove anything about that\n  feature.\n- If you ask for a top 10 (`-t10` or `--top 10`), you mostly get what you\n  expect, but if the counts are close the lower ranks become 'unstable'; even\n  rank 1 and 2 sometimes switch places etc.\n- Compared with `topfew` (I wondered if this approximation algorithm could be\n  _an optional_ flag for `topfew`), this Python code is impressively close to\n  the Go performance, especially if reading a lot of data from a pipe.\n  Unfortunately, I fear that this algorithm is not parallelizable. But I leave\n  that, and the re-implementation in Go or Rust, as an exercise for the reader\n  :)\n- Just try it!\n\n## Command-line interface\n\n```shell\nusage: aprxc [-h] [--top [X]] [--size SIZE] [--epsilon EPSILON]\n             [--delta DELTA] [--cheat | --no-cheat] [--verbose] [--version]\n             [--debug]\n             [path ...]\n\nEstimate the number of distinct lines in a file or stream.\n\npositional arguments:\n  path                  Input file path(s) and/or '-' for stdin (default:\n                        stdin)\n\noptions:\n  -h, --help            show this help message and exit\n  --top [X], -t [X]     EXPERIMENTAL: Show X most common values. Off by\n                        default. If enabled, X defaults to 10.\n  --size SIZE, -s SIZE  Total amount of data items, if known in advance. (Can\n                        be approximated.)\n  --epsilon EPSILON, -E EPSILON\n  --delta DELTA, -D DELTA\n  --cheat, --no-cheat   Use 'total seen' number as upper bound for unique\n                        count.\n  --verbose, -v\n  --version, -V         show program's version number and exit\n  --debug\n```\n\n---\n\n[^1]:\n    The benchmark script:\n\n    ```shell\n    cat /dev/urandom | pv -q -L 1000M | base64 -w 32 | command time ./aprxc.py --debug --epsilon=0.1 --delta=0.1\n    ```\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A command-line tool to estimate the number of distinct lines in a file/stream using Chakraborty/Vinodchandran/Meel\u2019s approximation algorithm.",
    "version": "1.1.3",
    "project_urls": {
        "Codeberg": "https://codeberg.org/fa81/aprxc",
        "GitHub": "https://github.com/hellp/aprxc"
    },
    "split_keywords": [
        "algorithm",
        " cli",
        " computer-science",
        " math"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "0b84dd8ece79ff57749c6f3a02e494c902efdef98ce2e3ac65db760ae573824f",
                "md5": "dc4e08f9fa77ae2c08128815935d9134",
                "sha256": "3eaf94439b802d94459115685e7541f1c9eaf52a6f62db709d209e07c2e2089c"
            },
            "downloads": -1,
            "filename": "aprxc-1.1.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "dc4e08f9fa77ae2c08128815935d9134",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.11",
            "size": 12284,
            "upload_time": "2024-08-26T09:35:53",
            "upload_time_iso_8601": "2024-08-26T09:35:53.909192Z",
            "url": "https://files.pythonhosted.org/packages/0b/84/dd8ece79ff57749c6f3a02e494c902efdef98ce2e3ac65db760ae573824f/aprxc-1.1.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "0cbe3ef54e82b82b05f1fa83341a3cadeaeed040a33e26eb893952241e802715",
                "md5": "adc17fcb7187d2e718a3337596615a99",
                "sha256": "a73664d1fa1827b7fa55822961f5ce2701650faa691996887b79767ebce7f318"
            },
            "downloads": -1,
            "filename": "aprxc-1.1.3.tar.gz",
            "has_sig": false,
            "md5_digest": "adc17fcb7187d2e718a3337596615a99",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.11",
            "size": 11370,
            "upload_time": "2024-08-26T09:35:55",
            "upload_time_iso_8601": "2024-08-26T09:35:55.366513Z",
            "url": "https://files.pythonhosted.org/packages/0c/be/3ef54e82b82b05f1fa83341a3cadeaeed040a33e26eb893952241e802715/aprxc-1.1.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-08-26 09:35:55",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "hellp",
    "github_project": "aprxc",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "aprxc"
}
        
Elapsed time: 0.35304s