numba-stats


Namenumba-stats JSON
Version 1.10.0 PyPI version JSON
download
home_pageNone
SummaryNumba-accelerated implementations of scipy probability distributions and others used in particle physics
upload_time2024-09-24 16:28:23
maintainerNone
docs_urlNone
authorNone
requires_python>=3.9
licenseMIT
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage
            # numba-stats

![](https://img.shields.io/pypi/v/numba-stats.svg)
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.13236518.svg)](https://doi.org/10.5281/zenodo.13236518)

We provide JIT-compiled (with numba) implementations of common probability distributions.

* Uniform
* (Truncated) Normal
* Log-normal
* Poisson
* Binomial
* (Truncated) Exponential
* Student's t
* Voigtian
* Crystal Ball
* Generalised double-sided Crystal Ball
* Tsallis-Hagedorn, a model for the minimum bias pT distribution
* Q-Gaussian
* Bernstein density (not normalized to unity, use this in extended likelihood fits)
* Cruijff density (not normalized to unity, use this in extended likelihood fits)
* CMS-Shape

The speed gains are huge, up to a factor of 100 compared to `scipy`. Benchmarks are included in the repository and are run by `pytest`.

The distributions are optimized for the use in maximum-likelihood fits, where you query a distribution at many points with a single set of parameters.

## Usage

Each distribution is implemented in a submodule. Import the submodule that you need and call the functions in the module.
```py
from numba_stats import norm
import numpy as np

x = np.linspace(-10, 10)
mu = 2.0
sigma = 3.0

p = norm.pdf(x, mu, sigma)
c = norm.cdf(x, mu, sigma)
```
The functions are vectorized on the variate `x`, but not on the shape parameters of the distribution. Ideally, the following functions are implemented for each distribution:
* `pdf`: probability density function
* `logpdf`: the logarithm of the probability density function (can be computed more efficiently and accurately for some distributions)
* `cdf`: integral of the probability density function
* `ppf`:inverse of the cdf
* `rvs`: to generate random variates

`cdf` and `ppf` are missing for some distributions (e.g. `voigt`), if there is currently no fast implementation available. `logpdf` is only implemented if it is more efficient and accurate compared to computing `log(dist.pdf(...))`. `rvs` is only implemented for distributions that have `ppf`, which is used to generate the random variates. The implementations of `rvs` are currently not optimized for highest performance, but turn out to be useful in practice nevertheless.

The distributions in `numba_stats` can be used in other `numba`-JIT'ed functions. The functions in `numba_stats` use a single thread, but the implementations were written so that they profit from auto-parallelization. To enable this, call them from a JIT'ed function with the argument `parallel=True,fastmath=True`. You should always combine `parallel=True` with `fastmath=True`, since the latter enhances the gain from auto-parallelization.

```py
from numba_stats import norm
import numba as nb
import numpy as np

@nb.njit(parallel=True, fastmath=True)
def norm_pdf(x, mu, sigma):
  return norm.pdf(x, mu, sigma)

# this must be an array of float
x = np.linspace(-10, 10)

# these must be floats
mu = 2.0
sigma = 3.0

# uses all your CPU cores
p = norm_pdf(x, mu, sigma)
```

Note that this is only faster if `x` has sufficient length (about 1000 elements or more). Otherwise, the parallelization overhead will make the call slower, see benchmarks below.

#### Gotchas and workarounds

##### TypingErrors

When you use the numba-stats distributions in a compiled function, you need to pass the expected data types. The first argument must be numpy array of floats (32 or 64 bit). The following parameters must be floats. If you pass the wrong arguments, you will get numba errors similar to this one (where parameters were passed as integer instead of float):
```
numba.core.errors.TypingError: Failed in nopython mode pipeline (step: nopython frontend)
No implementation of function Function(<function pdf at 0x7ff7186b7be0>) found for signature:

 >>> pdf(array(float64, 1d, C), int64, int64)
```
You won't get these errors when you call the numba-stats PDFs outside of a compiled function, because I added some wrappers which automatically convert the data types for convenience. This is why you can call `norm.pdf(1, 2, 3)`
but
`norm_pdf(1, 2, 3)` (as implemented above) will fail.

##### High-dimensional arrays

To keep the implementation simple, the PDFs all operate on 1D array arguments. If you have a higher-dimensional array, you can reshape it, pass it to our function and the shape it back. This is a cheap operation.

```py
x = ... # some higher dimensional array
# y = norm_pdf(x, 0.0, 1.0) this fails
y = norm_pdf(x.reshape(-1), 0.0, 1.0).reshape(x.shape)  # OK
```

## Documentation

To get documentation, please use `help()` in the Python interpreter.

Functions with equivalents in `scipy.stats` follow the `scipy` calling conventions exactly, except for distributions starting with `trunc...`, which follow a different convention, since the `scipy` behavior is very impractical. Even so, note that the `scipy` conventions are sometimes a bit unusual, particular in case of the exponential, the log-normal, and the uniform distribution. See the `scipy` docs for details.

## Citation

If you use this package in a scientific work, please cite us. You can generate citations in your preferred format on the [Zenodo website](https://doi.org/10.5281/zenodo.13236518).

## Benchmarks

The following benchmarks were produced on an Intel(R) Core(TM) i7-8569U CPU @ 2.80GHz against SciPy-1.10.1. The dotted line on the right-hand figure shows the expected speedup (4x) from parallelization on a CPU with four physical cores.

We see large speed-ups with respect to `scipy` for almost all distributions. Also calls with short arrays profit from `numba_stats`, due to the reduced call-overhead. The functions `voigt.pdf` and `t.ppf` do not run faster than the `scipy` versions, because we call the respective `scipy` implementation written in FORTRAN. The advantage provided by `numba_stats` here is that you can call these functions from other `numba`-JIT'ed functions, which is not possible with the `scipy` implementations, and `voigt.pdf` still profits from auto-parallelization.

The `bernstein.density` does not profit from auto-parallelization, on the contrary it becomes much slower, so this should be avoided. This is a known issue, the internal implementation cannot be easily auto-parallelized.

![](docs/_static/norm.pdf.svg)
![](docs/_static/norm.cdf.svg)
![](docs/_static/norm.ppf.svg)
![](docs/_static/expon.pdf.svg)
![](docs/_static/expon.cdf.svg)
![](docs/_static/expon.ppf.svg)
![](docs/_static/uniform.pdf.svg)
![](docs/_static/uniform.cdf.svg)
![](docs/_static/uniform.ppf.svg)
![](docs/_static/t.pdf.svg)
![](docs/_static/t.cdf.svg)
![](docs/_static/t.ppf.svg)
![](docs/_static/truncnorm.pdf.svg)
![](docs/_static/truncnorm.cdf.svg)
![](docs/_static/truncnorm.ppf.svg)
![](docs/_static/truncexpon.pdf.svg)
![](docs/_static/truncexpon.cdf.svg)
![](docs/_static/truncexpon.ppf.svg)
![](docs/_static/voigt.pdf.svg)
![](docs/_static/bernstein.density.svg)
![](docs/_static/truncexpon.pdf.plus.norm.pdf.svg)

## Contributions

**You can help with adding more distributions, patches are welcome.** Implementing a probability distribution is easy. You need to write it in simple Python that `numba` can understand. Special functions from `scipy.special` can be used after some wrapping, see submodule `numba_stats._special.py` how it is done.

## numba-stats and numba-scipy

[numba-scipy](https://github.com/numba/numba-scipy) is the official package and repository for fast numba-accelerated scipy functions, are we reinventing the wheel?

Ideally, the functionality in this package should be in `numba-scipy` and we hope that eventually this will be case. In this package, we don't offer overloads for scipy functions and classes like `numba-scipy` does. This simplifies the implementation dramatically. `numba-stats` is intended as a temporary solution until fast statistical functions are included in `numba-scipy`. `numba-stats` currently does not depend on `numba-scipy`, only on `numba` and `scipy`.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "numba-stats",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": null,
    "author": null,
    "author_email": "Hans Dembinski <hans.dembinski@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/a5/6a/9caf75aa81df5feb975c8668e7d4e6d1da0bae63ba3170e8784dd1b2e9a4/numba_stats-1.10.0.tar.gz",
    "platform": null,
    "description": "# numba-stats\n\n![](https://img.shields.io/pypi/v/numba-stats.svg)\n[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.13236518.svg)](https://doi.org/10.5281/zenodo.13236518)\n\nWe provide JIT-compiled (with numba) implementations of common probability distributions.\n\n* Uniform\n* (Truncated) Normal\n* Log-normal\n* Poisson\n* Binomial\n* (Truncated) Exponential\n* Student's t\n* Voigtian\n* Crystal Ball\n* Generalised double-sided Crystal Ball\n* Tsallis-Hagedorn, a model for the minimum bias pT distribution\n* Q-Gaussian\n* Bernstein density (not normalized to unity, use this in extended likelihood fits)\n* Cruijff density (not normalized to unity, use this in extended likelihood fits)\n* CMS-Shape\n\nThe speed gains are huge, up to a factor of 100 compared to `scipy`. Benchmarks are included in the repository and are run by `pytest`.\n\nThe distributions are optimized for the use in maximum-likelihood fits, where you query a distribution at many points with a single set of parameters.\n\n## Usage\n\nEach distribution is implemented in a submodule. Import the submodule that you need and call the functions in the module.\n```py\nfrom numba_stats import norm\nimport numpy as np\n\nx = np.linspace(-10, 10)\nmu = 2.0\nsigma = 3.0\n\np = norm.pdf(x, mu, sigma)\nc = norm.cdf(x, mu, sigma)\n```\nThe functions are vectorized on the variate `x`, but not on the shape parameters of the distribution. Ideally, the following functions are implemented for each distribution:\n* `pdf`: probability density function\n* `logpdf`: the logarithm of the probability density function (can be computed more efficiently and accurately for some distributions)\n* `cdf`: integral of the probability density function\n* `ppf`:inverse of the cdf\n* `rvs`: to generate random variates\n\n`cdf` and `ppf` are missing for some distributions (e.g. `voigt`), if there is currently no fast implementation available. `logpdf` is only implemented if it is more efficient and accurate compared to computing `log(dist.pdf(...))`. `rvs` is only implemented for distributions that have `ppf`, which is used to generate the random variates. The implementations of `rvs` are currently not optimized for highest performance, but turn out to be useful in practice nevertheless.\n\nThe distributions in `numba_stats` can be used in other `numba`-JIT'ed functions. The functions in `numba_stats` use a single thread, but the implementations were written so that they profit from auto-parallelization. To enable this, call them from a JIT'ed function with the argument `parallel=True,fastmath=True`. You should always combine `parallel=True` with `fastmath=True`, since the latter enhances the gain from auto-parallelization.\n\n```py\nfrom numba_stats import norm\nimport numba as nb\nimport numpy as np\n\n@nb.njit(parallel=True, fastmath=True)\ndef norm_pdf(x, mu, sigma):\n  return norm.pdf(x, mu, sigma)\n\n# this must be an array of float\nx = np.linspace(-10, 10)\n\n# these must be floats\nmu = 2.0\nsigma = 3.0\n\n# uses all your CPU cores\np = norm_pdf(x, mu, sigma)\n```\n\nNote that this is only faster if `x` has sufficient length (about 1000 elements or more). Otherwise, the parallelization overhead will make the call slower, see benchmarks below.\n\n#### Gotchas and workarounds\n\n##### TypingErrors\n\nWhen you use the numba-stats distributions in a compiled function, you need to pass the expected data types. The first argument must be numpy array of floats (32 or 64 bit). The following parameters must be floats. If you pass the wrong arguments, you will get numba errors similar to this one (where parameters were passed as integer instead of float):\n```\nnumba.core.errors.TypingError: Failed in nopython mode pipeline (step: nopython frontend)\nNo implementation of function Function(<function pdf at 0x7ff7186b7be0>) found for signature:\n\n >>> pdf(array(float64, 1d, C), int64, int64)\n```\nYou won't get these errors when you call the numba-stats PDFs outside of a compiled function, because I added some wrappers which automatically convert the data types for convenience. This is why you can call `norm.pdf(1, 2, 3)`\nbut\n`norm_pdf(1, 2, 3)` (as implemented above) will fail.\n\n##### High-dimensional arrays\n\nTo keep the implementation simple, the PDFs all operate on 1D array arguments. If you have a higher-dimensional array, you can reshape it, pass it to our function and the shape it back. This is a cheap operation.\n\n```py\nx = ... # some higher dimensional array\n# y = norm_pdf(x, 0.0, 1.0) this fails\ny = norm_pdf(x.reshape(-1), 0.0, 1.0).reshape(x.shape)  # OK\n```\n\n## Documentation\n\nTo get documentation, please use `help()` in the Python interpreter.\n\nFunctions with equivalents in `scipy.stats` follow the `scipy` calling conventions exactly, except for distributions starting with `trunc...`, which follow a different convention, since the `scipy` behavior is very impractical. Even so, note that the `scipy` conventions are sometimes a bit unusual, particular in case of the exponential, the log-normal, and the uniform distribution. See the `scipy` docs for details.\n\n## Citation\n\nIf you use this package in a scientific work, please cite us. You can generate citations in your preferred format on the [Zenodo website](https://doi.org/10.5281/zenodo.13236518).\n\n## Benchmarks\n\nThe following benchmarks were produced on an Intel(R) Core(TM) i7-8569U CPU @ 2.80GHz against SciPy-1.10.1. The dotted line on the right-hand figure shows the expected speedup (4x) from parallelization on a CPU with four physical cores.\n\nWe see large speed-ups with respect to `scipy` for almost all distributions. Also calls with short arrays profit from `numba_stats`, due to the reduced call-overhead. The functions `voigt.pdf` and `t.ppf` do not run faster than the `scipy` versions, because we call the respective `scipy` implementation written in FORTRAN. The advantage provided by `numba_stats` here is that you can call these functions from other `numba`-JIT'ed functions, which is not possible with the `scipy` implementations, and `voigt.pdf` still profits from auto-parallelization.\n\nThe `bernstein.density` does not profit from auto-parallelization, on the contrary it becomes much slower, so this should be avoided. This is a known issue, the internal implementation cannot be easily auto-parallelized.\n\n![](docs/_static/norm.pdf.svg)\n![](docs/_static/norm.cdf.svg)\n![](docs/_static/norm.ppf.svg)\n![](docs/_static/expon.pdf.svg)\n![](docs/_static/expon.cdf.svg)\n![](docs/_static/expon.ppf.svg)\n![](docs/_static/uniform.pdf.svg)\n![](docs/_static/uniform.cdf.svg)\n![](docs/_static/uniform.ppf.svg)\n![](docs/_static/t.pdf.svg)\n![](docs/_static/t.cdf.svg)\n![](docs/_static/t.ppf.svg)\n![](docs/_static/truncnorm.pdf.svg)\n![](docs/_static/truncnorm.cdf.svg)\n![](docs/_static/truncnorm.ppf.svg)\n![](docs/_static/truncexpon.pdf.svg)\n![](docs/_static/truncexpon.cdf.svg)\n![](docs/_static/truncexpon.ppf.svg)\n![](docs/_static/voigt.pdf.svg)\n![](docs/_static/bernstein.density.svg)\n![](docs/_static/truncexpon.pdf.plus.norm.pdf.svg)\n\n## Contributions\n\n**You can help with adding more distributions, patches are welcome.** Implementing a probability distribution is easy. You need to write it in simple Python that `numba` can understand. Special functions from `scipy.special` can be used after some wrapping, see submodule `numba_stats._special.py` how it is done.\n\n## numba-stats and numba-scipy\n\n[numba-scipy](https://github.com/numba/numba-scipy) is the official package and repository for fast numba-accelerated scipy functions, are we reinventing the wheel?\n\nIdeally, the functionality in this package should be in `numba-scipy` and we hope that eventually this will be case. In this package, we don't offer overloads for scipy functions and classes like `numba-scipy` does. This simplifies the implementation dramatically. `numba-stats` is intended as a temporary solution until fast statistical functions are included in `numba-scipy`. `numba-stats` currently does not depend on `numba-scipy`, only on `numba` and `scipy`.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Numba-accelerated implementations of scipy probability distributions and others used in particle physics",
    "version": "1.10.0",
    "project_urls": {
        "repository": "https://github.com/hdembinski/numba-stats"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "586d2bcb6f2b3f527b39cbeb52affa3b9e0f32a6f3bf9cec9ae5ccfaa4ed1cbf",
                "md5": "84a6d5e62ee4253d27bf70e8bfef02a0",
                "sha256": "9ca2fbf47dc235563e6ad83c907292a5e3baaf64eb5c76afddfd71cb050bf42e"
            },
            "downloads": -1,
            "filename": "numba_stats-1.10.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "84a6d5e62ee4253d27bf70e8bfef02a0",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 25570,
            "upload_time": "2024-09-24T16:28:22",
            "upload_time_iso_8601": "2024-09-24T16:28:22.333387Z",
            "url": "https://files.pythonhosted.org/packages/58/6d/2bcb6f2b3f527b39cbeb52affa3b9e0f32a6f3bf9cec9ae5ccfaa4ed1cbf/numba_stats-1.10.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a56a9caf75aa81df5feb975c8668e7d4e6d1da0bae63ba3170e8784dd1b2e9a4",
                "md5": "62d6841453bece3a4a060ab0171ad387",
                "sha256": "d65b0824b4f5a89cdcfd5f7d538ab636da01b2de715979f4b7156d6d72ac9fe9"
            },
            "downloads": -1,
            "filename": "numba_stats-1.10.0.tar.gz",
            "has_sig": false,
            "md5_digest": "62d6841453bece3a4a060ab0171ad387",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 215173,
            "upload_time": "2024-09-24T16:28:23",
            "upload_time_iso_8601": "2024-09-24T16:28:23.777197Z",
            "url": "https://files.pythonhosted.org/packages/a5/6a/9caf75aa81df5feb975c8668e7d4e6d1da0bae63ba3170e8784dd1b2e9a4/numba_stats-1.10.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-09-24 16:28:23",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "hdembinski",
    "github_project": "numba-stats",
    "travis_ci": false,
    "coveralls": true,
    "github_actions": true,
    "lcname": "numba-stats"
}
        
Elapsed time: 0.31919s