[![PyPI Version](https://img.shields.io/pypi/v/batchstats.svg)](https://pypi.org/project/batchstats/) [![conda Version](
https://anaconda.org/conda-forge/batchstats/badges/version.svg)](https://anaconda.org/conda-forge/batchstats)
# <img src="https://raw.githubusercontent.com/CyrilJl/BatchStats/main/docs/source/_static/logo_batchstats.svg" alt="Logo BatchStats" width="56.5" height="35"> BatchStats: Efficient Batch Statistics Computation in Python
`batchstats` is a Python package designed to compute various statistics of data that arrive batch by batch, making it suitable for streaming input or data too large to fit in memory.
## Installation
You can install `batchstats` using pip:
``` console
pip install batchstats
```
The package is also available on conda-forge:
``` console
conda install -c conda-forge batchstats
```
``` console
mamba install batchstats
```
## Usage
Here's an example of how to use `batchstats` to compute batch mean and variance:
```python
from batchstats import BatchMean, BatchVar
# Initialize BatchMean and BatchVar objects
batchmean = BatchMean()
batchvar = BatchVar()
# Iterate over your generator of data batches
for batch in your_data_generator:
# Update BatchMean and BatchVar with the current batch of data
batchmean.update_batch(batch)
batchvar.update_batch(batch)
# Compute and print the mean and variance
print("Batch Mean:", batchmean())
print("Batch Variance:", batchvar())
```
It is also possible to compute the covariance between two datasets:
```python
import numpy as np
from batchstats import BatchCov
n_samples, m, n = 10_000, 100, 50
data1 = np.random.randn(n_samples, m)
data2 = np.random.randn(n_samples, n)
n_batches = 7
batchcov = BatchCov()
for batch_index in np.array_split(np.arange(n_samples), n_batches):
batchcov.update_batch(batch1=data1[batch_index], batch2=data2[batch_index])
true_cov = (data1 - data1.mean(axis=0)).T@(data2 - data2.mean(axis=0))/n_samples
np.allclose(true_cov, batchcov()), batchcov().shape
>>> (True, (100, 50))
```
`batchstats` is also flexible in terms of input shapes. By default, statistics are applied along the first axis: the first dimension representing the samples and the remaining dimensions representing the features:
```python
import numpy as np
from batchstats import BatchSum
data = np.random.randn(10_000, 80, 90)
n_batches = 7
batchsum = BatchSum()
for batch_data in np.array_split(data, n_batches):
batchsum.update_batch(batch_data)
true_sum = np.sum(data, axis=0)
np.allclose(true_sum, batchsum()), batchsum().shape
>>> (True, (80, 90))
```
However, similar to the associated functions in `numpy`, users can specify the reduction axis or axes:
```python
import numpy as np
from batchstats import BatchMean
data = [np.random.randn(24, 7, 128) for _ in range(100)]
batchmean = BatchMean(axis=(0, 2))
for batch in data:
batchmean.update_batch(batch)
batchmean().shape
>>> (7,)
batchmean = BatchMean(axis=2)
for batch in data:
batchmean.update_batch(batch)
batchmean().shape
>>> (24, 7)
```
## Available Classes/Stats
- `BatchCov`: Compute the covariance matrix of two datasets (not necessarily square)
- `BatchMax`: Compute the maximum value (associated to `np.max`)
- `BatchMean`: Compute the mean (associated to `np.mean`)
- `BatchMin`: Compute the minimum value (associated to `np.min`)
- `BatchPeakToPeak`: Compute maximum - minimum value (associated to `np.ptp`)
- `BatchStd`: Compute the standard deviation (associated to `np.std`)
- `BatchSum`: Compute the sum (associated to `np.sum`)
- `BatchVar`: Compute the variance (associated to `np.var`)
Each class is tested against numpy results to ensure accuracy. For example:
```python
import numpy as np
from batchstats import BatchMean
def test_mean(data, n_batches):
true_stat = np.mean(data, axis=0)
batchmean = BatchMean()
for batch_data in np.array_split(data, n_batches):
batchmean.update_batch(batch=batch_data)
batch_stat = batchmean()
return np.allclose(true_stat, batch_stat)
data = np.random.randn(1_000_000, 50)
n_batches = 31
test_mean(data, n_batches)
>>> True
```
## Performance
In addition to result accuracy, much attention has been given to computation times and memory usage. Fun fact, calculating the variance using `batchstats` consumes little RAM while being faster than `numpy.var`:
```python
%load_ext memory_profiler
import numpy as np
from batchstats import BatchVar
data = np.random.randn(100_000, 1000)
print(data.nbytes/2**20)
%memit a = np.var(data, axis=0)
%memit b = BatchVar().update_batch(data)()
np.allclose(a, b)
>>> 762.939453125
>>> peak memory: 1604.63 MiB, increment: 763.35 MiB
>>> peak memory: 842.62 MiB, increment: 0.91 MiB
>>> True
%timeit a = np.var(data, axis=0)
%timeit b = BatchVar().update_batch(data)()
>>> 510 ms ± 111 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> 306 ms ± 5.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```
## NaN handling possibility
While the previous `Batch*` classes exclude every sample containing at least one NaN from the computations, the `BatchNan*` classes adopt a more flexible approach to handling NaN values, similar to `np.nansum`, `np.nanmean`, etc. Consequently, the outputted statistics can be computed from various numbers of samples for each feature:
```python
import numpy as np
from batchstats import BatchNanSum
m, n = 1_000_000, 50
nan_ratio = 0.05
n_batches = 17
data = np.random.randn(m, n)
num_nans = int(m * n * nan_ratio)
nan_indices = np.random.choice(range(m * n), num_nans, replace=False)
data.ravel()[nan_indices] = np.nan
batchsum = BatchNanSum()
for batch_data in np.array_split(data, n_batches):
batchsum.update_batch(batch=batch_data)
np.allclose(np.nansum(data, axis=0), batchsum())
>>> True
```
## Documentation
The documentation is available [here](https://batchstats.readthedocs.io/en/latest/).
## Requesting Additional Statistics
If you require additional statistics that are not currently implemented in `batchstats`, feel free to open an issue on the GitHub repository or submit a pull request with your suggested feature. We welcome contributions and feedback from the community to improve `batchstats` and make it more versatile for various data analysis tasks.
Raw data
{
"_id": null,
"home_page": "https://github.com/CyrilJl/BatchStats",
"name": "batchstats",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": null,
"author": "Cyril Joly",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/4a/6a/1b18f627fbd01e857542a37d6c104e23676189a3636cfc70a23aaf7089f3/batchstats-0.4.2.tar.gz",
"platform": null,
"description": "[![PyPI Version](https://img.shields.io/pypi/v/batchstats.svg)](https://pypi.org/project/batchstats/) [![conda Version](\nhttps://anaconda.org/conda-forge/batchstats/badges/version.svg)](https://anaconda.org/conda-forge/batchstats)\n\n# <img src=\"https://raw.githubusercontent.com/CyrilJl/BatchStats/main/docs/source/_static/logo_batchstats.svg\" alt=\"Logo BatchStats\" width=\"56.5\" height=\"35\"> BatchStats: Efficient Batch Statistics Computation in Python\n\n`batchstats` is a Python package designed to compute various statistics of data that arrive batch by batch, making it suitable for streaming input or data too large to fit in memory.\n\n## Installation\n\nYou can install `batchstats` using pip:\n\n``` console\npip install batchstats\n```\nThe package is also available on conda-forge:\n``` console\nconda install -c conda-forge batchstats\n```\n\n``` console\nmamba install batchstats\n```\n\n## Usage\n\nHere's an example of how to use `batchstats` to compute batch mean and variance:\n\n```python\nfrom batchstats import BatchMean, BatchVar\n\n# Initialize BatchMean and BatchVar objects\nbatchmean = BatchMean()\nbatchvar = BatchVar()\n\n# Iterate over your generator of data batches\nfor batch in your_data_generator:\n # Update BatchMean and BatchVar with the current batch of data\n batchmean.update_batch(batch)\n batchvar.update_batch(batch)\n\n# Compute and print the mean and variance\nprint(\"Batch Mean:\", batchmean())\nprint(\"Batch Variance:\", batchvar())\n```\n\nIt is also possible to compute the covariance between two datasets:\n\n```python\nimport numpy as np\nfrom batchstats import BatchCov\n\nn_samples, m, n = 10_000, 100, 50\ndata1 = np.random.randn(n_samples, m)\ndata2 = np.random.randn(n_samples, n)\nn_batches = 7\n\nbatchcov = BatchCov()\nfor batch_index in np.array_split(np.arange(n_samples), n_batches):\n batchcov.update_batch(batch1=data1[batch_index], batch2=data2[batch_index])\ntrue_cov = (data1 - data1.mean(axis=0)).T@(data2 - data2.mean(axis=0))/n_samples\nnp.allclose(true_cov, batchcov()), batchcov().shape\n>>> (True, (100, 50))\n```\n\n`batchstats` is also flexible in terms of input shapes. By default, statistics are applied along the first axis: the first dimension representing the samples and the remaining dimensions representing the features:\n\n```python\nimport numpy as np\nfrom batchstats import BatchSum\n\ndata = np.random.randn(10_000, 80, 90)\nn_batches = 7\n\nbatchsum = BatchSum()\nfor batch_data in np.array_split(data, n_batches):\n batchsum.update_batch(batch_data)\n\ntrue_sum = np.sum(data, axis=0)\nnp.allclose(true_sum, batchsum()), batchsum().shape\n>>> (True, (80, 90))\n```\n\nHowever, similar to the associated functions in `numpy`, users can specify the reduction axis or axes:\n\n```python\nimport numpy as np\nfrom batchstats import BatchMean\n\ndata = [np.random.randn(24, 7, 128) for _ in range(100)]\n\nbatchmean = BatchMean(axis=(0, 2))\nfor batch in data:\n batchmean.update_batch(batch)\nbatchmean().shape\n>>> (7,)\n\nbatchmean = BatchMean(axis=2)\nfor batch in data:\n batchmean.update_batch(batch)\nbatchmean().shape\n>>> (24, 7)\n```\n\n## Available Classes/Stats\n\n- `BatchCov`: Compute the covariance matrix of two datasets (not necessarily square)\n- `BatchMax`: Compute the maximum value (associated to `np.max`)\n- `BatchMean`: Compute the mean (associated to `np.mean`)\n- `BatchMin`: Compute the minimum value (associated to `np.min`)\n- `BatchPeakToPeak`: Compute maximum - minimum value (associated to `np.ptp`)\n- `BatchStd`: Compute the standard deviation (associated to `np.std`)\n- `BatchSum`: Compute the sum (associated to `np.sum`)\n- `BatchVar`: Compute the variance (associated to `np.var`)\n\nEach class is tested against numpy results to ensure accuracy. For example:\n\n```python\nimport numpy as np\nfrom batchstats import BatchMean\n\ndef test_mean(data, n_batches):\n true_stat = np.mean(data, axis=0)\n\n batchmean = BatchMean()\n for batch_data in np.array_split(data, n_batches):\n batchmean.update_batch(batch=batch_data)\n batch_stat = batchmean()\n return np.allclose(true_stat, batch_stat)\n\ndata = np.random.randn(1_000_000, 50)\nn_batches = 31\ntest_mean(data, n_batches)\n>>> True\n```\n\n## Performance\n\nIn addition to result accuracy, much attention has been given to computation times and memory usage. Fun fact, calculating the variance using `batchstats` consumes little RAM while being faster than `numpy.var`:\n\n```python\n%load_ext memory_profiler\nimport numpy as np\nfrom batchstats import BatchVar\n\ndata = np.random.randn(100_000, 1000)\nprint(data.nbytes/2**20)\n\n%memit a = np.var(data, axis=0)\n%memit b = BatchVar().update_batch(data)()\nnp.allclose(a, b)\n>>> 762.939453125\n>>> peak memory: 1604.63 MiB, increment: 763.35 MiB\n>>> peak memory: 842.62 MiB, increment: 0.91 MiB\n>>> True\n%timeit a = np.var(data, axis=0)\n%timeit b = BatchVar().update_batch(data)()\n>>> 510 ms \u00b1 111 ms per loop (mean \u00b1 std. dev. of 7 runs, 1 loop each)\n>>> 306 ms \u00b1 5.09 ms per loop (mean \u00b1 std. dev. of 7 runs, 1 loop each)\n```\n\n## NaN handling possibility\n\nWhile the previous `Batch*` classes exclude every sample containing at least one NaN from the computations, the `BatchNan*` classes adopt a more flexible approach to handling NaN values, similar to `np.nansum`, `np.nanmean`, etc. Consequently, the outputted statistics can be computed from various numbers of samples for each feature:\n\n```python\nimport numpy as np\nfrom batchstats import BatchNanSum\n\nm, n = 1_000_000, 50\nnan_ratio = 0.05\nn_batches = 17\n\ndata = np.random.randn(m, n)\nnum_nans = int(m * n * nan_ratio)\nnan_indices = np.random.choice(range(m * n), num_nans, replace=False)\ndata.ravel()[nan_indices] = np.nan\n\nbatchsum = BatchNanSum()\nfor batch_data in np.array_split(data, n_batches):\n batchsum.update_batch(batch=batch_data)\nnp.allclose(np.nansum(data, axis=0), batchsum())\n>>> True\n```\n\n## Documentation\n\nThe documentation is available [here](https://batchstats.readthedocs.io/en/latest/).\n\n## Requesting Additional Statistics\n\nIf you require additional statistics that are not currently implemented in `batchstats`, feel free to open an issue on the GitHub repository or submit a pull request with your suggested feature. We welcome contributions and feedback from the community to improve `batchstats` and make it more versatile for various data analysis tasks.\n",
"bugtrack_url": null,
"license": null,
"summary": "Efficient batch statistics computation library for Python.",
"version": "0.4.2",
"project_urls": {
"Homepage": "https://github.com/CyrilJl/BatchStats"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "4a6a1b18f627fbd01e857542a37d6c104e23676189a3636cfc70a23aaf7089f3",
"md5": "e2cb754295a3f809f180ac7b7ed37c1c",
"sha256": "477504f4ec1be07b0197d188db91f01068a9ddf5dcfbe3c8562110d374afb25d"
},
"downloads": -1,
"filename": "batchstats-0.4.2.tar.gz",
"has_sig": false,
"md5_digest": "e2cb754295a3f809f180ac7b7ed37c1c",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 11223,
"upload_time": "2024-06-09T10:29:32",
"upload_time_iso_8601": "2024-06-09T10:29:32.890238Z",
"url": "https://files.pythonhosted.org/packages/4a/6a/1b18f627fbd01e857542a37d6c104e23676189a3636cfc70a23aaf7089f3/batchstats-0.4.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-06-09 10:29:32",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "CyrilJl",
"github_project": "BatchStats",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "batchstats"
}