vose


Namevose JSON
Version 0.1.1 PyPI version JSON
download
home_pagehttps://github.com/MaxHalford/vose
SummaryCython implementation of Vose's Alias method.
upload_time2025-07-08 13:39:13
maintainerNone
docs_urlNone
authorMax Halford
requires_python>=3.11
licenseMIT
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # vose

This is a Cython implemention of Michael Vose's [Alias method](https://www.wikiwand.com/en/Alias_method). It can be used to perform weighted sampling with replacement of integers in `O(1)` time. It requires a construction phase that runs in `O(n)` time, with `n` being the number of integers with associated weights. As far as I know, it's faster than any other method available in Python. But I would love to be proven wrong!

I wrote this because I had a specific usecase where I needed to repeatidly sample integers with a weight associated to each integer. I stumbled on Keith Schwarz's [_Darts, Dice, and Coins: Sampling from a Discrete Distribution_](https://www.keithschwarz.com/darts-dice-coins/), which is very well written, and decided to use the Alias method. Alas, `numpy` doesn't seem to have it available, and neither does the `random` module from Python's standard library. There is, however, the [`vose_sampler`](https://github.com/asmith26/Vose-Alias-Method) package, but it is written in pure Python and isn't fast enough for my purposes. I therefore decided to write it in Cython and shamelessly adapted Keith Schmarz's [Java implementation](https://www.keithschwarz.com/interesting/code/?dir=alias-method).

## Installation

```sh
pip install vose
```

## Usage

You first have to initialize a sampler with an array of weights. These weights are not required to sum up to 1.

```py
>>> import numpy as np
>>> import vose

>>> weights = np.array([.1, .3, .4, .2])
>>> sampler = vose.Sampler(weights, seed=42)

```

You can then call the `.sample()` method to sample a random integer in range `[0, n - 1]`, where `n` is the number of weights that were passed.

```py
>>> sampler.sample()
3

```

You can set the `k` parameter in order to produce multiple samples.

```py
>>> sampler.sample(k=10)
array([3, 3, 2, 1, 1, 2, 2, 3, 0, 2])

```

By default, a copy of the weights is made. You can disable this behavior in order to save a few microseconds, but this will modify the provided array.

```py
>>> sampler = vose.Sampler(weights, seed=42, copy=False)

```

Note that `vose.Sampler` expects to be provided with a [memoryview](https://docs.python.org/3/c-api/memoryview.html). In order to pass a list, you have to convert it yourself to a numpy array.

```py
>>> weights = [.2, .3, .5]
>>> sampler = vose.Sampler(np.array(weights))

```

You can also use `vose.Sampler` from within your own Cython `.pyx` file:

```py
import numpy as np

cimport vose
cimport numpy as np

cdef np.float [:] weights = np.array([.2, .3, .5], dtype=float)
cdef vose.Sampler sampler
sampler = vose.Sampler(weights)

cdef int sample = sampler.sample_1()
cdef np.int [:] samples = sampler.sample_k(10)
```

Note that the latter requires having to include the `numpy` headers in the extension definition of your `setup.py`:

```py
from setuptools import Extension
from setuptools import setup
from Cython.Build import cythonize
import numpy as np

extension = Extension(
    '*', ['your_file.pyx'],
    include_dirs=[np.get_include()],
    define_macros=[('NPY_NO_DEPRECATED_API', 'NPY_1_7_API_VERSION')]
)

setup(ext_modules=cythonize([extension]))
```

## Is it reliable?

It seems to be working correctly; at least according to the following [chi-squared tests](https://www.wikiwand.com/en/Chi-squared_test):

```py
>>> import numpy as np
>>> from scipy import stats

>>> rng = np.random.default_rng(seed=42)
>>> k = 1000

>>> for n in range(3, 20):
...     weights = rng.dirichlet(np.arange(1, n))
...     sampler = vose.Sampler(weights, seed=42)
...     samples = sampler.sample(k)
...     chi2 = stats.chisquare(f_obs=np.bincount(samples), f_exp=weights * k)
...     assert chi2.pvalue > .05

```

It is also reproducible:

```py
>>> import numpy as np
>>> import vose
>>> probs = np.array([0.5, 0.5])
>>> a = vose.Sampler(probs, seed=0)
>>> b = vose.Sampler(probs, seed=0)
>>> for _ in range(10000):
...     assert a.sample() == b.sample()

```

Note that the `seed` method can be used to set the state of the sampler's RNG without having to re-initialize the weights:

```py
>>> import numpy as np
>>> import vose
>>> probs = np.ones(5)
>>> a = vose.Sampler(probs, seed=3)
>>> a.sample(4)
array([2, 2, 2, 3])
>>> a.seed(3)
>>> a.sample(4)
array([2, 2, 2, 3])

```

## Is it fast?

Hell yeah. The following graph shows the average time taken to sample one integer for different amounts of weights:

<div align="center">
    <img width="60%" src="figures/sampling_time.svg">
</div>
<br>

As you can see, `vose.Sampler` takes less than a nanosecond to produce a random integer. Here is the construction time:

<div align="center">
    <img width="60%" src="figures/construction_time.svg">
</div>
<br>

`vose.Sampler` is also fast enough to compete with `numpy` and `random`, even when including the construction time. The following table summarizes a comparison I made on my laptop, with `n` being the number of weights and `k` the number of integers to sample:

|    n |    k | np.random.choice | random.choices | vose.Sampler | vose_sampler.VoseAlias |
| ---: | ---: | :--------------- | :------------- | :----------- | :--------------------- |
|    3 |    1 | 26ns             | 2ns            | 4ns          | 11ns                   |
|    3 |    2 | 26ns             | 3ns            | 7ns          | 13ns                   |
|    3 |    3 | 26ns             | 3ns            | 7ns          | 14ns                   |
|    3 |   10 | 26ns             | 6ns            | 7ns          | 27ns                   |
|    3 |  100 | 28ns             | 47ns           | 8ns          | 198ns                  |
|    3 | 1000 | 46ns             | 461ns          | 19ns         | 1μs, 887ns             |
|   30 |    1 | 27ns             | 6ns            | 4ns          | 69ns                   |
|   30 |    2 | 26ns             | 7ns            | 7ns          | 73ns                   |
|   30 |    3 | 27ns             | 7ns            | 8ns          | 72ns                   |
|   30 |   10 | 27ns             | 14ns           | 7ns          | 88ns                   |
|   30 |  100 | 31ns             | 63ns           | 8ns          | 256ns                  |
|   30 | 1000 | 67ns             | 580ns          | 19ns         | 1μs, 935ns             |
|  300 |    1 | 29ns             | 47ns           | 6ns          | 661ns                  |
|  300 |    2 | 29ns             | 47ns           | 9ns          | 659ns                  |
|  300 |    3 | 29ns             | 49ns           | 9ns          | 685ns                  |
|  300 |   10 | 29ns             | 54ns           | 9ns          | 685ns                  |
|  300 |  100 | 36ns             | 112ns          | 10ns         | 877ns                  |
|  300 | 1000 | 96ns             | 717ns          | 20ns         | 2μs, 599ns             |
| 3000 |    1 | 52ns             | 416ns          | 18ns         | 6μs, 988ns             |
| 3000 |    2 | 50ns             | 420ns          | 21ns         | 7μs, 39ns              |
| 3000 |    3 | 51ns             | 439ns          | 21ns         | 7μs, 102ns             |
| 3000 |   10 | 51ns             | 420ns          | 21ns         | 7μs, 332ns             |
| 3000 |  100 | 59ns             | 496ns          | 23ns         | 7μs, 349ns             |
| 3000 | 1000 | 137ns            | 1μs, 213ns     | 35ns         | 10μs, 190ns            |

In summary, you probably don't need to be using `vose.Sampler` if you only need to sample once, regardless of the number of integers you wish to sample. You want to use `vose.Sampler` when you need to sample in a sequential manner, because at that point the construction time will be amortized. Indeed, this will bring you two orders of magnitude improved speed, when compared to calling `np.random.choice` or `random.choices` multiple times.

## Development

```sh
git clone https://github.com/MaxHalford/vose
cd vose
python setup.py build_ext --inplace
pytest
```

## Further work

- The weights assigned to each integer cannot be modified. A [search tree](https://www.wikiwand.com/en/Search_tree) can be used as a data structure that supports modifications. This allows modifying weights in `O(log(n))` time, but means sampling also happens in `O(log(n))` time. More information [here](https://stackoverflow.com/questions/34247459/an-efficient-version-alternative-to-the-alias-method-that-samples-without-replac).
- I'm not 100% the memory allocation of the memoryviews is optimal.
- Initializing a `vose.Sampler` from another Cython `.pyx` file seems to require some Python interaction; there's probably a way to avoid this.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/MaxHalford/vose",
    "name": "vose",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.11",
    "maintainer_email": null,
    "keywords": null,
    "author": "Max Halford",
    "author_email": "maxhalford25@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/ca/9f/31fc07bcf482ce29f3e52c6733f91990267f42429b178284c0b620d05178/vose-0.1.1.tar.gz",
    "platform": null,
    "description": "# vose\n\nThis is a Cython implemention of Michael Vose's [Alias method](https://www.wikiwand.com/en/Alias_method). It can be used to perform weighted sampling with replacement of integers in `O(1)` time. It requires a construction phase that runs in `O(n)` time, with `n` being the number of integers with associated weights. As far as I know, it's faster than any other method available in Python. But I would love to be proven wrong!\n\nI wrote this because I had a specific usecase where I needed to repeatidly sample integers with a weight associated to each integer. I stumbled on Keith Schwarz's [_Darts, Dice, and Coins: Sampling from a Discrete Distribution_](https://www.keithschwarz.com/darts-dice-coins/), which is very well written, and decided to use the Alias method. Alas, `numpy` doesn't seem to have it available, and neither does the `random` module from Python's standard library. There is, however, the [`vose_sampler`](https://github.com/asmith26/Vose-Alias-Method) package, but it is written in pure Python and isn't fast enough for my purposes. I therefore decided to write it in Cython and shamelessly adapted Keith Schmarz's [Java implementation](https://www.keithschwarz.com/interesting/code/?dir=alias-method).\n\n## Installation\n\n```sh\npip install vose\n```\n\n## Usage\n\nYou first have to initialize a sampler with an array of weights. These weights are not required to sum up to 1.\n\n```py\n>>> import numpy as np\n>>> import vose\n\n>>> weights = np.array([.1, .3, .4, .2])\n>>> sampler = vose.Sampler(weights, seed=42)\n\n```\n\nYou can then call the `.sample()` method to sample a random integer in range `[0, n - 1]`, where `n` is the number of weights that were passed.\n\n```py\n>>> sampler.sample()\n3\n\n```\n\nYou can set the `k` parameter in order to produce multiple samples.\n\n```py\n>>> sampler.sample(k=10)\narray([3, 3, 2, 1, 1, 2, 2, 3, 0, 2])\n\n```\n\nBy default, a copy of the weights is made. You can disable this behavior in order to save a few microseconds, but this will modify the provided array.\n\n```py\n>>> sampler = vose.Sampler(weights, seed=42, copy=False)\n\n```\n\nNote that `vose.Sampler` expects to be provided with a [memoryview](https://docs.python.org/3/c-api/memoryview.html). In order to pass a list, you have to convert it yourself to a numpy array.\n\n```py\n>>> weights = [.2, .3, .5]\n>>> sampler = vose.Sampler(np.array(weights))\n\n```\n\nYou can also use `vose.Sampler` from within your own Cython `.pyx` file:\n\n```py\nimport numpy as np\n\ncimport vose\ncimport numpy as np\n\ncdef np.float [:] weights = np.array([.2, .3, .5], dtype=float)\ncdef vose.Sampler sampler\nsampler = vose.Sampler(weights)\n\ncdef int sample = sampler.sample_1()\ncdef np.int [:] samples = sampler.sample_k(10)\n```\n\nNote that the latter requires having to include the `numpy` headers in the extension definition of your `setup.py`:\n\n```py\nfrom setuptools import Extension\nfrom setuptools import setup\nfrom Cython.Build import cythonize\nimport numpy as np\n\nextension = Extension(\n    '*', ['your_file.pyx'],\n    include_dirs=[np.get_include()],\n    define_macros=[('NPY_NO_DEPRECATED_API', 'NPY_1_7_API_VERSION')]\n)\n\nsetup(ext_modules=cythonize([extension]))\n```\n\n## Is it reliable?\n\nIt seems to be working correctly; at least according to the following [chi-squared tests](https://www.wikiwand.com/en/Chi-squared_test):\n\n```py\n>>> import numpy as np\n>>> from scipy import stats\n\n>>> rng = np.random.default_rng(seed=42)\n>>> k = 1000\n\n>>> for n in range(3, 20):\n...     weights = rng.dirichlet(np.arange(1, n))\n...     sampler = vose.Sampler(weights, seed=42)\n...     samples = sampler.sample(k)\n...     chi2 = stats.chisquare(f_obs=np.bincount(samples), f_exp=weights * k)\n...     assert chi2.pvalue > .05\n\n```\n\nIt is also reproducible:\n\n```py\n>>> import numpy as np\n>>> import vose\n>>> probs = np.array([0.5, 0.5])\n>>> a = vose.Sampler(probs, seed=0)\n>>> b = vose.Sampler(probs, seed=0)\n>>> for _ in range(10000):\n...     assert a.sample() == b.sample()\n\n```\n\nNote that the `seed` method can be used to set the state of the sampler's RNG without having to re-initialize the weights:\n\n```py\n>>> import numpy as np\n>>> import vose\n>>> probs = np.ones(5)\n>>> a = vose.Sampler(probs, seed=3)\n>>> a.sample(4)\narray([2, 2, 2, 3])\n>>> a.seed(3)\n>>> a.sample(4)\narray([2, 2, 2, 3])\n\n```\n\n## Is it fast?\n\nHell yeah. The following graph shows the average time taken to sample one integer for different amounts of weights:\n\n<div align=\"center\">\n    <img width=\"60%\" src=\"figures/sampling_time.svg\">\n</div>\n<br>\n\nAs you can see, `vose.Sampler` takes less than a nanosecond to produce a random integer. Here is the construction time:\n\n<div align=\"center\">\n    <img width=\"60%\" src=\"figures/construction_time.svg\">\n</div>\n<br>\n\n`vose.Sampler` is also fast enough to compete with `numpy` and `random`, even when including the construction time. The following table summarizes a comparison I made on my laptop, with `n` being the number of weights and `k` the number of integers to sample:\n\n|    n |    k | np.random.choice | random.choices | vose.Sampler | vose_sampler.VoseAlias |\n| ---: | ---: | :--------------- | :------------- | :----------- | :--------------------- |\n|    3 |    1 | 26ns             | 2ns            | 4ns          | 11ns                   |\n|    3 |    2 | 26ns             | 3ns            | 7ns          | 13ns                   |\n|    3 |    3 | 26ns             | 3ns            | 7ns          | 14ns                   |\n|    3 |   10 | 26ns             | 6ns            | 7ns          | 27ns                   |\n|    3 |  100 | 28ns             | 47ns           | 8ns          | 198ns                  |\n|    3 | 1000 | 46ns             | 461ns          | 19ns         | 1\u03bcs, 887ns             |\n|   30 |    1 | 27ns             | 6ns            | 4ns          | 69ns                   |\n|   30 |    2 | 26ns             | 7ns            | 7ns          | 73ns                   |\n|   30 |    3 | 27ns             | 7ns            | 8ns          | 72ns                   |\n|   30 |   10 | 27ns             | 14ns           | 7ns          | 88ns                   |\n|   30 |  100 | 31ns             | 63ns           | 8ns          | 256ns                  |\n|   30 | 1000 | 67ns             | 580ns          | 19ns         | 1\u03bcs, 935ns             |\n|  300 |    1 | 29ns             | 47ns           | 6ns          | 661ns                  |\n|  300 |    2 | 29ns             | 47ns           | 9ns          | 659ns                  |\n|  300 |    3 | 29ns             | 49ns           | 9ns          | 685ns                  |\n|  300 |   10 | 29ns             | 54ns           | 9ns          | 685ns                  |\n|  300 |  100 | 36ns             | 112ns          | 10ns         | 877ns                  |\n|  300 | 1000 | 96ns             | 717ns          | 20ns         | 2\u03bcs, 599ns             |\n| 3000 |    1 | 52ns             | 416ns          | 18ns         | 6\u03bcs, 988ns             |\n| 3000 |    2 | 50ns             | 420ns          | 21ns         | 7\u03bcs, 39ns              |\n| 3000 |    3 | 51ns             | 439ns          | 21ns         | 7\u03bcs, 102ns             |\n| 3000 |   10 | 51ns             | 420ns          | 21ns         | 7\u03bcs, 332ns             |\n| 3000 |  100 | 59ns             | 496ns          | 23ns         | 7\u03bcs, 349ns             |\n| 3000 | 1000 | 137ns            | 1\u03bcs, 213ns     | 35ns         | 10\u03bcs, 190ns            |\n\nIn summary, you probably don't need to be using `vose.Sampler` if you only need to sample once, regardless of the number of integers you wish to sample. You want to use `vose.Sampler` when you need to sample in a sequential manner, because at that point the construction time will be amortized. Indeed, this will bring you two orders of magnitude improved speed, when compared to calling `np.random.choice` or `random.choices` multiple times.\n\n## Development\n\n```sh\ngit clone https://github.com/MaxHalford/vose\ncd vose\npython setup.py build_ext --inplace\npytest\n```\n\n## Further work\n\n- The weights assigned to each integer cannot be modified. A [search tree](https://www.wikiwand.com/en/Search_tree) can be used as a data structure that supports modifications. This allows modifying weights in `O(log(n))` time, but means sampling also happens in `O(log(n))` time. More information [here](https://stackoverflow.com/questions/34247459/an-efficient-version-alternative-to-the-alias-method-that-samples-without-replac).\n- I'm not 100% the memory allocation of the memoryviews is optimal.\n- Initializing a `vose.Sampler` from another Cython `.pyx` file seems to require some Python interaction; there's probably a way to avoid this.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Cython implementation of Vose's Alias method.",
    "version": "0.1.1",
    "project_urls": {
        "Homepage": "https://github.com/MaxHalford/vose"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "dbdd685a2e41c33e7fba8d8c723bed3fc87722f16ddf2551bc67bd52deb1d109",
                "md5": "be9fe786a1b9650630985eee75461aae",
                "sha256": "c16cddaaf7bd7c0827b89eb3fbf3608e02444dc3921be0658efb81a0a4c6d3c8"
            },
            "downloads": -1,
            "filename": "vose-0.1.1-cp311-cp311-macosx_15_0_arm64.whl",
            "has_sig": false,
            "md5_digest": "be9fe786a1b9650630985eee75461aae",
            "packagetype": "bdist_wheel",
            "python_version": "cp311",
            "requires_python": ">=3.11",
            "size": 7642,
            "upload_time": "2025-07-08T13:39:12",
            "upload_time_iso_8601": "2025-07-08T13:39:12.212394Z",
            "url": "https://files.pythonhosted.org/packages/db/dd/685a2e41c33e7fba8d8c723bed3fc87722f16ddf2551bc67bd52deb1d109/vose-0.1.1-cp311-cp311-macosx_15_0_arm64.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ca9f31fc07bcf482ce29f3e52c6733f91990267f42429b178284c0b620d05178",
                "md5": "fbfd16e56d3554ea25a568c3313cad1e",
                "sha256": "92cf9fa22e7f1e2926c67f3c6f35e89eac35f5552eaeeda881e61a579e87fdf9"
            },
            "downloads": -1,
            "filename": "vose-0.1.1.tar.gz",
            "has_sig": false,
            "md5_digest": "fbfd16e56d3554ea25a568c3313cad1e",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.11",
            "size": 6642,
            "upload_time": "2025-07-08T13:39:13",
            "upload_time_iso_8601": "2025-07-08T13:39:13.468563Z",
            "url": "https://files.pythonhosted.org/packages/ca/9f/31fc07bcf482ce29f3e52c6733f91990267f42429b178284c0b620d05178/vose-0.1.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-08 13:39:13",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "MaxHalford",
    "github_project": "vose",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "vose"
}
        
Elapsed time: 0.87992s