batchstats

Name	batchstats JSON
Version	0.4.2 JSON
	download
home_page	https://github.com/CyrilJl/BatchStats
Summary	Efficient batch statistics computation library for Python.
upload_time	2024-06-09 10:29:32
maintainer	None
docs_url	None
author	Cyril Joly
requires_python	None
license	None
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            [![PyPI Version](https://img.shields.io/pypi/v/batchstats.svg)](https://pypi.org/project/batchstats/) [![conda Version](
https://anaconda.org/conda-forge/batchstats/badges/version.svg)](https://anaconda.org/conda-forge/batchstats)

# <img src="https://raw.githubusercontent.com/CyrilJl/BatchStats/main/docs/source/_static/logo_batchstats.svg" alt="Logo BatchStats" width="56.5" height="35"> BatchStats: Efficient Batch Statistics Computation in Python

`batchstats` is a Python package designed to compute various statistics of data that arrive batch by batch, making it suitable for streaming input or data too large to fit in memory.

## Installation

You can install `batchstats` using pip:

``` console
pip install batchstats
```
The package is also available on conda-forge:
``` console
conda install -c conda-forge batchstats
```

``` console
mamba install batchstats
```

## Usage

Here's an example of how to use `batchstats` to compute batch mean and variance:

```python
from batchstats import BatchMean, BatchVar

# Initialize BatchMean and BatchVar objects
batchmean = BatchMean()
batchvar = BatchVar()

# Iterate over your generator of data batches
for batch in your_data_generator:
    # Update BatchMean and BatchVar with the current batch of data
    batchmean.update_batch(batch)
    batchvar.update_batch(batch)

# Compute and print the mean and variance
print("Batch Mean:", batchmean())
print("Batch Variance:", batchvar())
```

It is also possible to compute the covariance between two datasets:

```python
import numpy as np
from batchstats import BatchCov

n_samples, m, n = 10_000, 100, 50
data1 = np.random.randn(n_samples, m)
data2 = np.random.randn(n_samples, n)
n_batches = 7

batchcov = BatchCov()
for batch_index in np.array_split(np.arange(n_samples), n_batches):
    batchcov.update_batch(batch1=data1[batch_index], batch2=data2[batch_index])
true_cov = (data1 - data1.mean(axis=0)).T@(data2 - data2.mean(axis=0))/n_samples
np.allclose(true_cov, batchcov()), batchcov().shape
>>> (True, (100, 50))
```

`batchstats` is also flexible in terms of input shapes. By default, statistics are applied along the first axis: the first dimension representing the samples and the remaining dimensions representing the features:

```python
import numpy as np
from batchstats import BatchSum

data = np.random.randn(10_000, 80, 90)
n_batches = 7

batchsum = BatchSum()
for batch_data in np.array_split(data, n_batches):
    batchsum.update_batch(batch_data)

true_sum = np.sum(data, axis=0)
np.allclose(true_sum, batchsum()), batchsum().shape
>>> (True, (80, 90))
```

However, similar to the associated functions in `numpy`, users can specify the reduction axis or axes:

```python
import numpy as np
from batchstats import BatchMean

data = [np.random.randn(24, 7, 128) for _ in range(100)]

batchmean = BatchMean(axis=(0, 2))
for batch in data:
    batchmean.update_batch(batch)
batchmean().shape
>>> (7,)

batchmean = BatchMean(axis=2)
for batch in data:
    batchmean.update_batch(batch)
batchmean().shape
>>> (24, 7)
```

## Available Classes/Stats

- `BatchCov`: Compute the covariance matrix of two datasets (not necessarily square)
- `BatchMax`: Compute the maximum value (associated to `np.max`)
- `BatchMean`: Compute the mean (associated to `np.mean`)
- `BatchMin`: Compute the minimum value (associated to `np.min`)
- `BatchPeakToPeak`: Compute maximum - minimum value (associated to `np.ptp`)
- `BatchStd`: Compute the standard deviation (associated to `np.std`)
- `BatchSum`: Compute the sum (associated to `np.sum`)
- `BatchVar`: Compute the variance (associated to `np.var`)

Each class is tested against numpy results to ensure accuracy. For example:

```python
import numpy as np
from batchstats import BatchMean

def test_mean(data, n_batches):
    true_stat = np.mean(data, axis=0)

    batchmean = BatchMean()
    for batch_data in np.array_split(data, n_batches):
        batchmean.update_batch(batch=batch_data)
    batch_stat = batchmean()
    return np.allclose(true_stat, batch_stat)

data = np.random.randn(1_000_000, 50)
n_batches = 31
test_mean(data, n_batches)
>>> True
```

## Performance

In addition to result accuracy, much attention has been given to computation times and memory usage. Fun fact, calculating the variance using `batchstats` consumes little RAM while being faster than `numpy.var`:

```python
%load_ext memory_profiler
import numpy as np
from batchstats import BatchVar

data = np.random.randn(100_000, 1000)
print(data.nbytes/2**20)

%memit a = np.var(data, axis=0)
%memit b = BatchVar().update_batch(data)()
np.allclose(a, b)
>>> 762.939453125
>>> peak memory: 1604.63 MiB, increment: 763.35 MiB
>>> peak memory: 842.62 MiB, increment: 0.91 MiB
>>> True
%timeit a = np.var(data, axis=0)
%timeit b = BatchVar().update_batch(data)()
>>> 510 ms ± 111 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> 306 ms ± 5.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```

## NaN handling possibility

While the previous `Batch*` classes exclude every sample containing at least one NaN from the computations, the `BatchNan*` classes adopt a more flexible approach to handling NaN values, similar to `np.nansum`, `np.nanmean`, etc. Consequently, the outputted statistics can be computed from various numbers of samples for each feature:

```python
import numpy as np
from batchstats import BatchNanSum

m, n = 1_000_000, 50
nan_ratio = 0.05
n_batches = 17

data = np.random.randn(m, n)
num_nans = int(m * n * nan_ratio)
nan_indices = np.random.choice(range(m * n), num_nans, replace=False)
data.ravel()[nan_indices] = np.nan

batchsum = BatchNanSum()
for batch_data in np.array_split(data, n_batches):
    batchsum.update_batch(batch=batch_data)
np.allclose(np.nansum(data, axis=0), batchsum())
>>> True
```

## Documentation

The documentation is available [here](https://batchstats.readthedocs.io/en/latest/).

## Requesting Additional Statistics

If you require additional statistics that are not currently implemented in `batchstats`, feel free to open an issue on the GitHub repository or submit a pull request with your suggested feature. We welcome contributions and feedback from the community to improve `batchstats` and make it more versatile for various data analysis tasks.

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/CyrilJl/BatchStats",
    "name": "batchstats",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": null,
    "author": "Cyril Joly",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/4a/6a/1b18f627fbd01e857542a37d6c104e23676189a3636cfc70a23aaf7089f3/batchstats-0.4.2.tar.gz",
    "platform": null,
    "description": "[![PyPI Version](https://img.shields.io/pypi/v/batchstats.svg)](https://pypi.org/project/batchstats/) [![conda Version](\nhttps://anaconda.org/conda-forge/batchstats/badges/version.svg)](https://anaconda.org/conda-forge/batchstats)\n\n# <img src=\"https://raw.githubusercontent.com/CyrilJl/BatchStats/main/docs/source/_static/logo_batchstats.svg\" alt=\"Logo BatchStats\" width=\"56.5\" height=\"35\"> BatchStats: Efficient Batch Statistics Computation in Python\n\n`batchstats` is a Python package designed to compute various statistics of data that arrive batch by batch, making it suitable for streaming input or data too large to fit in memory.\n\n## Installation\n\nYou can install `batchstats` using pip:\n\n``` console\npip install batchstats\n```\nThe package is also available on conda-forge:\n``` console\nconda install -c conda-forge batchstats\n```\n\n``` console\nmamba install batchstats\n```\n\n## Usage\n\nHere's an example of how to use `batchstats` to compute batch mean and variance:\n\n```python\nfrom batchstats import BatchMean, BatchVar\n\n# Initialize BatchMean and BatchVar objects\nbatchmean = BatchMean()\nbatchvar = BatchVar()\n\n# Iterate over your generator of data batches\nfor batch in your_data_generator:\n    # Update BatchMean and BatchVar with the current batch of data\n    batchmean.update_batch(batch)\n    batchvar.update_batch(batch)\n\n# Compute and print the mean and variance\nprint(\"Batch Mean:\", batchmean())\nprint(\"Batch Variance:\", batchvar())\n```\n\nIt is also possible to compute the covariance between two datasets:\n\n```python\nimport numpy as np\nfrom batchstats import BatchCov\n\nn_samples, m, n = 10_000, 100, 50\ndata1 = np.random.randn(n_samples, m)\ndata2 = np.random.randn(n_samples, n)\nn_batches = 7\n\nbatchcov = BatchCov()\nfor batch_index in np.array_split(np.arange(n_samples), n_batches):\n    batchcov.update_batch(batch1=data1[batch_index], batch2=data2[batch_index])\ntrue_cov = (data1 - data1.mean(axis=0)).T@(data2 - data2.mean(axis=0))/n_samples\nnp.allclose(true_cov, batchcov()), batchcov().shape\n>>> (True, (100, 50))\n```\n\n`batchstats` is also flexible in terms of input shapes. By default, statistics are applied along the first axis: the first dimension representing the samples and the remaining dimensions representing the features:\n\n```python\nimport numpy as np\nfrom batchstats import BatchSum\n\ndata = np.random.randn(10_000, 80, 90)\nn_batches = 7\n\nbatchsum = BatchSum()\nfor batch_data in np.array_split(data, n_batches):\n    batchsum.update_batch(batch_data)\n\ntrue_sum = np.sum(data, axis=0)\nnp.allclose(true_sum, batchsum()), batchsum().shape\n>>> (True, (80, 90))\n```\n\nHowever, similar to the associated functions in `numpy`, users can specify the reduction axis or axes:\n\n```python\nimport numpy as np\nfrom batchstats import BatchMean\n\ndata = [np.random.randn(24, 7, 128) for _ in range(100)]\n\nbatchmean = BatchMean(axis=(0, 2))\nfor batch in data:\n    batchmean.update_batch(batch)\nbatchmean().shape\n>>> (7,)\n\nbatchmean = BatchMean(axis=2)\nfor batch in data:\n    batchmean.update_batch(batch)\nbatchmean().shape\n>>> (24, 7)\n```\n\n## Available Classes/Stats\n\n- `BatchCov`: Compute the covariance matrix of two datasets (not necessarily square)\n- `BatchMax`: Compute the maximum value (associated to `np.max`)\n- `BatchMean`: Compute the mean (associated to `np.mean`)\n- `BatchMin`: Compute the minimum value (associated to `np.min`)\n- `BatchPeakToPeak`: Compute maximum - minimum value (associated to `np.ptp`)\n- `BatchStd`: Compute the standard deviation (associated to `np.std`)\n- `BatchSum`: Compute the sum (associated to `np.sum`)\n- `BatchVar`: Compute the variance (associated to `np.var`)\n\nEach class is tested against numpy results to ensure accuracy. For example:\n\n```python\nimport numpy as np\nfrom batchstats import BatchMean\n\ndef test_mean(data, n_batches):\n    true_stat = np.mean(data, axis=0)\n\n    batchmean = BatchMean()\n    for batch_data in np.array_split(data, n_batches):\n        batchmean.update_batch(batch=batch_data)\n    batch_stat = batchmean()\n    return np.allclose(true_stat, batch_stat)\n\ndata = np.random.randn(1_000_000, 50)\nn_batches = 31\ntest_mean(data, n_batches)\n>>> True\n```\n\n## Performance\n\nIn addition to result accuracy, much attention has been given to computation times and memory usage. Fun fact, calculating the variance using `batchstats` consumes little RAM while being faster than `numpy.var`:\n\n```python\n%load_ext memory_profiler\nimport numpy as np\nfrom batchstats import BatchVar\n\ndata = np.random.randn(100_000, 1000)\nprint(data.nbytes/2**20)\n\n%memit a = np.var(data, axis=0)\n%memit b = BatchVar().update_batch(data)()\nnp.allclose(a, b)\n>>> 762.939453125\n>>> peak memory: 1604.63 MiB, increment: 763.35 MiB\n>>> peak memory: 842.62 MiB, increment: 0.91 MiB\n>>> True\n%timeit a = np.var(data, axis=0)\n%timeit b = BatchVar().update_batch(data)()\n>>> 510 ms \u00b1 111 ms per loop (mean \u00b1 std. dev. of 7 runs, 1 loop each)\n>>> 306 ms \u00b1 5.09 ms per loop (mean \u00b1 std. dev. of 7 runs, 1 loop each)\n```\n\n## NaN handling possibility\n\nWhile the previous `Batch*` classes exclude every sample containing at least one NaN from the computations, the `BatchNan*` classes adopt a more flexible approach to handling NaN values, similar to `np.nansum`, `np.nanmean`, etc. Consequently, the outputted statistics can be computed from various numbers of samples for each feature:\n\n```python\nimport numpy as np\nfrom batchstats import BatchNanSum\n\nm, n = 1_000_000, 50\nnan_ratio = 0.05\nn_batches = 17\n\ndata = np.random.randn(m, n)\nnum_nans = int(m * n * nan_ratio)\nnan_indices = np.random.choice(range(m * n), num_nans, replace=False)\ndata.ravel()[nan_indices] = np.nan\n\nbatchsum = BatchNanSum()\nfor batch_data in np.array_split(data, n_batches):\n    batchsum.update_batch(batch=batch_data)\nnp.allclose(np.nansum(data, axis=0), batchsum())\n>>> True\n```\n\n## Documentation\n\nThe documentation is available [here](https://batchstats.readthedocs.io/en/latest/).\n\n## Requesting Additional Statistics\n\nIf you require additional statistics that are not currently implemented in `batchstats`, feel free to open an issue on the GitHub repository or submit a pull request with your suggested feature. We welcome contributions and feedback from the community to improve `batchstats` and make it more versatile for various data analysis tasks.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Efficient batch statistics computation library for Python.",
    "version": "0.4.2",
    "project_urls": {
        "Homepage": "https://github.com/CyrilJl/BatchStats"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "4a6a1b18f627fbd01e857542a37d6c104e23676189a3636cfc70a23aaf7089f3",
                "md5": "e2cb754295a3f809f180ac7b7ed37c1c",
                "sha256": "477504f4ec1be07b0197d188db91f01068a9ddf5dcfbe3c8562110d374afb25d"
            },
            "downloads": -1,
            "filename": "batchstats-0.4.2.tar.gz",
            "has_sig": false,
            "md5_digest": "e2cb754295a3f809f180ac7b7ed37c1c",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 11223,
            "upload_time": "2024-06-09T10:29:32",
            "upload_time_iso_8601": "2024-06-09T10:29:32.890238Z",
            "url": "https://files.pythonhosted.org/packages/4a/6a/1b18f627fbd01e857542a37d6c104e23676189a3636cfc70a23aaf7089f3/batchstats-0.4.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-06-09 10:29:32",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "CyrilJl",
    "github_project": "BatchStats",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "batchstats"
}

Cyril Joly