Name | vose JSON |
Version |
0.0.2
JSON |
| download |
home_page | https://github.com/MaxHalford/vose |
Summary | Cython implementation of Vose's Alias method. |
upload_time | 2024-08-23 08:50:51 |
maintainer | None |
docs_url | None |
author | Max Halford |
requires_python | >=3.10 |
license | MIT |
keywords |
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# vose
This is a Cython implemention of Michael Vose's [Alias method](https://www.wikiwand.com/en/Alias_method). It can be used to perform weighted sampling with replacement of integers in `O(1)` time. It requires a construction phase that runs in `O(n)` time, with `n` being the number of integers with associated weights. As far as I know, it's faster than any other method available in Python. But I would love to be proven wrong!
I wrote this because I had a specific usecase where I needed to repeatidly sample integers with a weight associated to each integer. I stumbled on Keith Schwarz's [_Darts, Dice, and Coins: Sampling from a Discrete Distribution_](https://www.keithschwarz.com/darts-dice-coins/), which is very well written, and decided to use the Alias method. Alas, `numpy` doesn't seem to have it available, and neither does the `random` module from Python's standard library. There is, however, the [`vose_sampler`](https://github.com/asmith26/Vose-Alias-Method) package, but it is written in pure Python and isn't fast enough for my purposes. I therefore decided to write it in Cython and shamelessly adapted Keith Schmarz's [Java implementation](https://www.keithschwarz.com/interesting/code/?dir=alias-method).
## Installation
```sh
pip install vose
```
## Usage
You first have to initialize a sampler with an array of weights. These weights are not required to sum up to 1.
```py
>>> import numpy as np
>>> import vose
>>> weights = np.array([.1, .3, .4, .2])
>>> sampler = vose.Sampler(weights, seed=42)
```
You can then call the `.sample()` method to sample a random integer in range `[0, n - 1]`, where `n` is the number of weights that were passed.
```py
>>> sampler.sample()
2
```
You can set the `k` parameter in order to produce multiple samples.
```py
>>> sampler.sample(k=10)
array([1, 1, 2, 1, 2, 1, 0, 1, 3, 3])
```
By default, a copy of the weights is made. You can disable this behavior in order to save a few microseconds, but this will modify the provided array.
```py
>>> sampler = vose.Sampler(weights, seed=42, copy=False)
```
Note that `vose.Sampler` expects to be provided with a [memoryview](https://docs.python.org/3/c-api/memoryview.html). In order to pass a list, you have to convert it yourself to a numpy array.
```py
>>> weights = [.2, .3, .5]
>>> sampler = vose.Sampler(np.array(weights))
```
You can also use `vose.Sampler` from within your own Cython `.pyx` file:
```py
import numpy as np
cimport vose
cimport numpy as np
cdef np.float [:] weights = np.array([.2, .3, .5], dtype=float)
cdef vose.Sampler sampler
sampler = vose.Sampler(weights)
cdef int sample = sampler.sample_1()
cdef np.int [:] samples = sampler.sample_k(10)
```
Note that the latter requires having to include the `numpy` headers in the extension definition of your `setup.py`:
```py
from setuptools import Extension
from setuptools import setup
from Cython.Build import cythonize
import numpy as np
extension = Extension(
'*', ['your_file.pyx'],
include_dirs=[np.get_include()],
define_macros=[('NPY_NO_DEPRECATED_API', 'NPY_1_7_API_VERSION')]
)
setup(ext_modules=cythonize([extension]))
```
## Is it reliable?
It seems to be working correctly; at least according to the following [chi-squared tests](https://www.wikiwand.com/en/Chi-squared_test):
```py
>>> import numpy as np
>>> from scipy import stats
>>> rng = np.random.RandomState(42)
>>> k = 1000
>>> for n in range(3, 20):
... weights = rng.dirichlet(np.arange(1, n))
... sampler = vose.Sampler(weights, seed=42)
... samples = sampler.sample(k)
... chi2 = stats.chisquare(f_obs=np.bincount(samples), f_exp=weights * k)
... assert chi2.pvalue > .05
```
## Is it fast?
Hell yeah. The following graph shows the average time taken to sample one integer for different amounts of weights:
<div align="center">
<img width="60%" src="figures/sampling_time.svg">
</div>
<br>
As you can see, `vose.Sampler` takes less than a nanosecond to produce a random integer. Here is the construction time:
<div align="center">
<img width="60%" src="figures/construction_time.svg">
</div>
<br>
`vose.Sampler` is also fast enough to compete with `numpy` and `random`, even when including the construction time. The following table summarizes a comparison I made on my laptop, with `n` being the number of weights and `k` the number of integers to sample:
| n | k | np.random.choice | random.choices | vose.Sampler | vose_sampler.VoseAlias |
| ---: | ---: | :--------------- | :------------- | :----------- | :--------------------- |
| 3 | 1 | 26ns | 2ns | 4ns | 11ns |
| 3 | 2 | 26ns | 3ns | 7ns | 13ns |
| 3 | 3 | 26ns | 3ns | 7ns | 14ns |
| 3 | 10 | 26ns | 6ns | 7ns | 27ns |
| 3 | 100 | 28ns | 47ns | 8ns | 198ns |
| 3 | 1000 | 46ns | 461ns | 19ns | 1μs, 887ns |
| 30 | 1 | 27ns | 6ns | 4ns | 69ns |
| 30 | 2 | 26ns | 7ns | 7ns | 73ns |
| 30 | 3 | 27ns | 7ns | 8ns | 72ns |
| 30 | 10 | 27ns | 14ns | 7ns | 88ns |
| 30 | 100 | 31ns | 63ns | 8ns | 256ns |
| 30 | 1000 | 67ns | 580ns | 19ns | 1μs, 935ns |
| 300 | 1 | 29ns | 47ns | 6ns | 661ns |
| 300 | 2 | 29ns | 47ns | 9ns | 659ns |
| 300 | 3 | 29ns | 49ns | 9ns | 685ns |
| 300 | 10 | 29ns | 54ns | 9ns | 685ns |
| 300 | 100 | 36ns | 112ns | 10ns | 877ns |
| 300 | 1000 | 96ns | 717ns | 20ns | 2μs, 599ns |
| 3000 | 1 | 52ns | 416ns | 18ns | 6μs, 988ns |
| 3000 | 2 | 50ns | 420ns | 21ns | 7μs, 39ns |
| 3000 | 3 | 51ns | 439ns | 21ns | 7μs, 102ns |
| 3000 | 10 | 51ns | 420ns | 21ns | 7μs, 332ns |
| 3000 | 100 | 59ns | 496ns | 23ns | 7μs, 349ns |
| 3000 | 1000 | 137ns | 1μs, 213ns | 35ns | 10μs, 190ns |
In summary, you probably don't need to be using `vose.Sampler` if you only need to sample once, regardless of the number of integers you wish to sample. You want to use `vose.Sampler` when you need to sample in a sequential manner, because at that point the construction time will be amortized. Indeed, this will bring you two orders of magnitude improved speed, when compared to calling `np.random.choice` or `random.choices` multiple times.
## Development
```sh
git clone https://github.com/MaxHalford/vose
cd vose
python setup.py build_ext --inplace
pytest
```
## Further work
- The weights assigned to each integer cannot be modified. A [search tree](https://www.wikiwand.com/en/Search_tree) can be used as a data structure that supports modifications. This allows modifying weights in `O(log(n))` time, but means sampling also happens in `O(log(n))` time. More information [here](https://stackoverflow.com/questions/34247459/an-efficient-version-alternative-to-the-alias-method-that-samples-without-replac).
- I'm not 100% the memory allocation of the memoryviews is optimal.
- Initializing a `vose.Sampler` from another Cython `.pyx` file seems to require some Python interaction; there's probably a way to avoid this.
Raw data
{
"_id": null,
"home_page": "https://github.com/MaxHalford/vose",
"name": "vose",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": null,
"author": "Max Halford",
"author_email": "maxhalford25@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/49/c8/56c6a060de3a2d0dd42f2b9d1505108e9ef7d4cccb0df93d3dedfbebedf3/vose-0.0.2.tar.gz",
"platform": null,
"description": "# vose\n\nThis is a Cython implemention of Michael Vose's [Alias method](https://www.wikiwand.com/en/Alias_method). It can be used to perform weighted sampling with replacement of integers in `O(1)` time. It requires a construction phase that runs in `O(n)` time, with `n` being the number of integers with associated weights. As far as I know, it's faster than any other method available in Python. But I would love to be proven wrong!\n\nI wrote this because I had a specific usecase where I needed to repeatidly sample integers with a weight associated to each integer. I stumbled on Keith Schwarz's [_Darts, Dice, and Coins: Sampling from a Discrete Distribution_](https://www.keithschwarz.com/darts-dice-coins/), which is very well written, and decided to use the Alias method. Alas, `numpy` doesn't seem to have it available, and neither does the `random` module from Python's standard library. There is, however, the [`vose_sampler`](https://github.com/asmith26/Vose-Alias-Method) package, but it is written in pure Python and isn't fast enough for my purposes. I therefore decided to write it in Cython and shamelessly adapted Keith Schmarz's [Java implementation](https://www.keithschwarz.com/interesting/code/?dir=alias-method).\n\n## Installation\n\n```sh\npip install vose\n```\n\n## Usage\n\nYou first have to initialize a sampler with an array of weights. These weights are not required to sum up to 1.\n\n```py\n>>> import numpy as np\n>>> import vose\n\n>>> weights = np.array([.1, .3, .4, .2])\n>>> sampler = vose.Sampler(weights, seed=42)\n\n```\n\nYou can then call the `.sample()` method to sample a random integer in range `[0, n - 1]`, where `n` is the number of weights that were passed.\n\n```py\n>>> sampler.sample()\n2\n\n```\n\nYou can set the `k` parameter in order to produce multiple samples.\n\n```py\n>>> sampler.sample(k=10)\narray([1, 1, 2, 1, 2, 1, 0, 1, 3, 3])\n\n```\n\nBy default, a copy of the weights is made. You can disable this behavior in order to save a few microseconds, but this will modify the provided array.\n\n```py\n>>> sampler = vose.Sampler(weights, seed=42, copy=False)\n\n```\n\nNote that `vose.Sampler` expects to be provided with a [memoryview](https://docs.python.org/3/c-api/memoryview.html). In order to pass a list, you have to convert it yourself to a numpy array.\n\n```py\n>>> weights = [.2, .3, .5]\n>>> sampler = vose.Sampler(np.array(weights))\n\n```\n\nYou can also use `vose.Sampler` from within your own Cython `.pyx` file:\n\n```py\nimport numpy as np\n\ncimport vose\ncimport numpy as np\n\ncdef np.float [:] weights = np.array([.2, .3, .5], dtype=float)\ncdef vose.Sampler sampler\nsampler = vose.Sampler(weights)\n\ncdef int sample = sampler.sample_1()\ncdef np.int [:] samples = sampler.sample_k(10)\n```\n\nNote that the latter requires having to include the `numpy` headers in the extension definition of your `setup.py`:\n\n```py\nfrom setuptools import Extension\nfrom setuptools import setup\nfrom Cython.Build import cythonize\nimport numpy as np\n\nextension = Extension(\n '*', ['your_file.pyx'],\n include_dirs=[np.get_include()],\n define_macros=[('NPY_NO_DEPRECATED_API', 'NPY_1_7_API_VERSION')]\n)\n\nsetup(ext_modules=cythonize([extension]))\n```\n\n## Is it reliable?\n\nIt seems to be working correctly; at least according to the following [chi-squared tests](https://www.wikiwand.com/en/Chi-squared_test):\n\n```py\n>>> import numpy as np\n>>> from scipy import stats\n\n>>> rng = np.random.RandomState(42)\n>>> k = 1000\n\n>>> for n in range(3, 20):\n... weights = rng.dirichlet(np.arange(1, n))\n... sampler = vose.Sampler(weights, seed=42)\n... samples = sampler.sample(k)\n... chi2 = stats.chisquare(f_obs=np.bincount(samples), f_exp=weights * k)\n... assert chi2.pvalue > .05\n\n```\n\n## Is it fast?\n\nHell yeah. The following graph shows the average time taken to sample one integer for different amounts of weights:\n\n<div align=\"center\">\n <img width=\"60%\" src=\"figures/sampling_time.svg\">\n</div>\n<br>\n\nAs you can see, `vose.Sampler` takes less than a nanosecond to produce a random integer. Here is the construction time:\n\n<div align=\"center\">\n <img width=\"60%\" src=\"figures/construction_time.svg\">\n</div>\n<br>\n\n`vose.Sampler` is also fast enough to compete with `numpy` and `random`, even when including the construction time. The following table summarizes a comparison I made on my laptop, with `n` being the number of weights and `k` the number of integers to sample:\n\n| n | k | np.random.choice | random.choices | vose.Sampler | vose_sampler.VoseAlias |\n| ---: | ---: | :--------------- | :------------- | :----------- | :--------------------- |\n| 3 | 1 | 26ns | 2ns | 4ns | 11ns |\n| 3 | 2 | 26ns | 3ns | 7ns | 13ns |\n| 3 | 3 | 26ns | 3ns | 7ns | 14ns |\n| 3 | 10 | 26ns | 6ns | 7ns | 27ns |\n| 3 | 100 | 28ns | 47ns | 8ns | 198ns |\n| 3 | 1000 | 46ns | 461ns | 19ns | 1\u03bcs, 887ns |\n| 30 | 1 | 27ns | 6ns | 4ns | 69ns |\n| 30 | 2 | 26ns | 7ns | 7ns | 73ns |\n| 30 | 3 | 27ns | 7ns | 8ns | 72ns |\n| 30 | 10 | 27ns | 14ns | 7ns | 88ns |\n| 30 | 100 | 31ns | 63ns | 8ns | 256ns |\n| 30 | 1000 | 67ns | 580ns | 19ns | 1\u03bcs, 935ns |\n| 300 | 1 | 29ns | 47ns | 6ns | 661ns |\n| 300 | 2 | 29ns | 47ns | 9ns | 659ns |\n| 300 | 3 | 29ns | 49ns | 9ns | 685ns |\n| 300 | 10 | 29ns | 54ns | 9ns | 685ns |\n| 300 | 100 | 36ns | 112ns | 10ns | 877ns |\n| 300 | 1000 | 96ns | 717ns | 20ns | 2\u03bcs, 599ns |\n| 3000 | 1 | 52ns | 416ns | 18ns | 6\u03bcs, 988ns |\n| 3000 | 2 | 50ns | 420ns | 21ns | 7\u03bcs, 39ns |\n| 3000 | 3 | 51ns | 439ns | 21ns | 7\u03bcs, 102ns |\n| 3000 | 10 | 51ns | 420ns | 21ns | 7\u03bcs, 332ns |\n| 3000 | 100 | 59ns | 496ns | 23ns | 7\u03bcs, 349ns |\n| 3000 | 1000 | 137ns | 1\u03bcs, 213ns | 35ns | 10\u03bcs, 190ns |\n\nIn summary, you probably don't need to be using `vose.Sampler` if you only need to sample once, regardless of the number of integers you wish to sample. You want to use `vose.Sampler` when you need to sample in a sequential manner, because at that point the construction time will be amortized. Indeed, this will bring you two orders of magnitude improved speed, when compared to calling `np.random.choice` or `random.choices` multiple times.\n\n## Development\n\n```sh\ngit clone https://github.com/MaxHalford/vose\ncd vose\npython setup.py build_ext --inplace\npytest\n```\n\n## Further work\n\n- The weights assigned to each integer cannot be modified. A [search tree](https://www.wikiwand.com/en/Search_tree) can be used as a data structure that supports modifications. This allows modifying weights in `O(log(n))` time, but means sampling also happens in `O(log(n))` time. More information [here](https://stackoverflow.com/questions/34247459/an-efficient-version-alternative-to-the-alias-method-that-samples-without-replac).\n- I'm not 100% the memory allocation of the memoryviews is optimal.\n- Initializing a `vose.Sampler` from another Cython `.pyx` file seems to require some Python interaction; there's probably a way to avoid this.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Cython implementation of Vose's Alias method.",
"version": "0.0.2",
"project_urls": {
"Homepage": "https://github.com/MaxHalford/vose"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "49c856c6a060de3a2d0dd42f2b9d1505108e9ef7d4cccb0df93d3dedfbebedf3",
"md5": "8a219dea1364bb1f67c3bec06ce124db",
"sha256": "46a96f7593fad7821a6fca80135791393126322bc9ac6ddd9840fd555fa2c693"
},
"downloads": -1,
"filename": "vose-0.0.2.tar.gz",
"has_sig": false,
"md5_digest": "8a219dea1364bb1f67c3bec06ce124db",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 168368,
"upload_time": "2024-08-23T08:50:51",
"upload_time_iso_8601": "2024-08-23T08:50:51.666109Z",
"url": "https://files.pythonhosted.org/packages/49/c8/56c6a060de3a2d0dd42f2b9d1505108e9ef7d4cccb0df93d3dedfbebedf3/vose-0.0.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-08-23 08:50:51",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "MaxHalford",
"github_project": "vose",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "vose"
}