vose


Namevose JSON
Version 0.0.2 PyPI version JSON
download
home_pagehttps://github.com/MaxHalford/vose
SummaryCython implementation of Vose's Alias method.
upload_time2024-08-23 08:50:51
maintainerNone
docs_urlNone
authorMax Halford
requires_python>=3.10
licenseMIT
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # vose

This is a Cython implemention of Michael Vose's [Alias method](https://www.wikiwand.com/en/Alias_method). It can be used to perform weighted sampling with replacement of integers in `O(1)` time. It requires a construction phase that runs in `O(n)` time, with `n` being the number of integers with associated weights. As far as I know, it's faster than any other method available in Python. But I would love to be proven wrong!

I wrote this because I had a specific usecase where I needed to repeatidly sample integers with a weight associated to each integer. I stumbled on Keith Schwarz's [_Darts, Dice, and Coins: Sampling from a Discrete Distribution_](https://www.keithschwarz.com/darts-dice-coins/), which is very well written, and decided to use the Alias method. Alas, `numpy` doesn't seem to have it available, and neither does the `random` module from Python's standard library. There is, however, the [`vose_sampler`](https://github.com/asmith26/Vose-Alias-Method) package, but it is written in pure Python and isn't fast enough for my purposes. I therefore decided to write it in Cython and shamelessly adapted Keith Schmarz's [Java implementation](https://www.keithschwarz.com/interesting/code/?dir=alias-method).

## Installation

```sh
pip install vose
```

## Usage

You first have to initialize a sampler with an array of weights. These weights are not required to sum up to 1.

```py
>>> import numpy as np
>>> import vose

>>> weights = np.array([.1, .3, .4, .2])
>>> sampler = vose.Sampler(weights, seed=42)

```

You can then call the `.sample()` method to sample a random integer in range `[0, n - 1]`, where `n` is the number of weights that were passed.

```py
>>> sampler.sample()
2

```

You can set the `k` parameter in order to produce multiple samples.

```py
>>> sampler.sample(k=10)
array([1, 1, 2, 1, 2, 1, 0, 1, 3, 3])

```

By default, a copy of the weights is made. You can disable this behavior in order to save a few microseconds, but this will modify the provided array.

```py
>>> sampler = vose.Sampler(weights, seed=42, copy=False)

```

Note that `vose.Sampler` expects to be provided with a [memoryview](https://docs.python.org/3/c-api/memoryview.html). In order to pass a list, you have to convert it yourself to a numpy array.

```py
>>> weights = [.2, .3, .5]
>>> sampler = vose.Sampler(np.array(weights))

```

You can also use `vose.Sampler` from within your own Cython `.pyx` file:

```py
import numpy as np

cimport vose
cimport numpy as np

cdef np.float [:] weights = np.array([.2, .3, .5], dtype=float)
cdef vose.Sampler sampler
sampler = vose.Sampler(weights)

cdef int sample = sampler.sample_1()
cdef np.int [:] samples = sampler.sample_k(10)
```

Note that the latter requires having to include the `numpy` headers in the extension definition of your `setup.py`:

```py
from setuptools import Extension
from setuptools import setup
from Cython.Build import cythonize
import numpy as np

extension = Extension(
    '*', ['your_file.pyx'],
    include_dirs=[np.get_include()],
    define_macros=[('NPY_NO_DEPRECATED_API', 'NPY_1_7_API_VERSION')]
)

setup(ext_modules=cythonize([extension]))
```

## Is it reliable?

It seems to be working correctly; at least according to the following [chi-squared tests](https://www.wikiwand.com/en/Chi-squared_test):

```py
>>> import numpy as np
>>> from scipy import stats

>>> rng = np.random.RandomState(42)
>>> k = 1000

>>> for n in range(3, 20):
...     weights = rng.dirichlet(np.arange(1, n))
...     sampler = vose.Sampler(weights, seed=42)
...     samples = sampler.sample(k)
...     chi2 = stats.chisquare(f_obs=np.bincount(samples), f_exp=weights * k)
...     assert chi2.pvalue > .05

```

## Is it fast?

Hell yeah. The following graph shows the average time taken to sample one integer for different amounts of weights:

<div align="center">
    <img width="60%" src="figures/sampling_time.svg">
</div>
<br>

As you can see, `vose.Sampler` takes less than a nanosecond to produce a random integer. Here is the construction time:

<div align="center">
    <img width="60%" src="figures/construction_time.svg">
</div>
<br>

`vose.Sampler` is also fast enough to compete with `numpy` and `random`, even when including the construction time. The following table summarizes a comparison I made on my laptop, with `n` being the number of weights and `k` the number of integers to sample:

|    n |    k | np.random.choice | random.choices | vose.Sampler | vose_sampler.VoseAlias |
| ---: | ---: | :--------------- | :------------- | :----------- | :--------------------- |
|    3 |    1 | 26ns             | 2ns            | 4ns          | 11ns                   |
|    3 |    2 | 26ns             | 3ns            | 7ns          | 13ns                   |
|    3 |    3 | 26ns             | 3ns            | 7ns          | 14ns                   |
|    3 |   10 | 26ns             | 6ns            | 7ns          | 27ns                   |
|    3 |  100 | 28ns             | 47ns           | 8ns          | 198ns                  |
|    3 | 1000 | 46ns             | 461ns          | 19ns         | 1μs, 887ns             |
|   30 |    1 | 27ns             | 6ns            | 4ns          | 69ns                   |
|   30 |    2 | 26ns             | 7ns            | 7ns          | 73ns                   |
|   30 |    3 | 27ns             | 7ns            | 8ns          | 72ns                   |
|   30 |   10 | 27ns             | 14ns           | 7ns          | 88ns                   |
|   30 |  100 | 31ns             | 63ns           | 8ns          | 256ns                  |
|   30 | 1000 | 67ns             | 580ns          | 19ns         | 1μs, 935ns             |
|  300 |    1 | 29ns             | 47ns           | 6ns          | 661ns                  |
|  300 |    2 | 29ns             | 47ns           | 9ns          | 659ns                  |
|  300 |    3 | 29ns             | 49ns           | 9ns          | 685ns                  |
|  300 |   10 | 29ns             | 54ns           | 9ns          | 685ns                  |
|  300 |  100 | 36ns             | 112ns          | 10ns         | 877ns                  |
|  300 | 1000 | 96ns             | 717ns          | 20ns         | 2μs, 599ns             |
| 3000 |    1 | 52ns             | 416ns          | 18ns         | 6μs, 988ns             |
| 3000 |    2 | 50ns             | 420ns          | 21ns         | 7μs, 39ns              |
| 3000 |    3 | 51ns             | 439ns          | 21ns         | 7μs, 102ns             |
| 3000 |   10 | 51ns             | 420ns          | 21ns         | 7μs, 332ns             |
| 3000 |  100 | 59ns             | 496ns          | 23ns         | 7μs, 349ns             |
| 3000 | 1000 | 137ns            | 1μs, 213ns     | 35ns         | 10μs, 190ns            |

In summary, you probably don't need to be using `vose.Sampler` if you only need to sample once, regardless of the number of integers you wish to sample. You want to use `vose.Sampler` when you need to sample in a sequential manner, because at that point the construction time will be amortized. Indeed, this will bring you two orders of magnitude improved speed, when compared to calling `np.random.choice` or `random.choices` multiple times.

## Development

```sh
git clone https://github.com/MaxHalford/vose
cd vose
python setup.py build_ext --inplace
pytest
```

## Further work

- The weights assigned to each integer cannot be modified. A [search tree](https://www.wikiwand.com/en/Search_tree) can be used as a data structure that supports modifications. This allows modifying weights in `O(log(n))` time, but means sampling also happens in `O(log(n))` time. More information [here](https://stackoverflow.com/questions/34247459/an-efficient-version-alternative-to-the-alias-method-that-samples-without-replac).
- I'm not 100% the memory allocation of the memoryviews is optimal.
- Initializing a `vose.Sampler` from another Cython `.pyx` file seems to require some Python interaction; there's probably a way to avoid this.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/MaxHalford/vose",
    "name": "vose",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": null,
    "author": "Max Halford",
    "author_email": "maxhalford25@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/49/c8/56c6a060de3a2d0dd42f2b9d1505108e9ef7d4cccb0df93d3dedfbebedf3/vose-0.0.2.tar.gz",
    "platform": null,
    "description": "# vose\n\nThis is a Cython implemention of Michael Vose's [Alias method](https://www.wikiwand.com/en/Alias_method). It can be used to perform weighted sampling with replacement of integers in `O(1)` time. It requires a construction phase that runs in `O(n)` time, with `n` being the number of integers with associated weights. As far as I know, it's faster than any other method available in Python. But I would love to be proven wrong!\n\nI wrote this because I had a specific usecase where I needed to repeatidly sample integers with a weight associated to each integer. I stumbled on Keith Schwarz's [_Darts, Dice, and Coins: Sampling from a Discrete Distribution_](https://www.keithschwarz.com/darts-dice-coins/), which is very well written, and decided to use the Alias method. Alas, `numpy` doesn't seem to have it available, and neither does the `random` module from Python's standard library. There is, however, the [`vose_sampler`](https://github.com/asmith26/Vose-Alias-Method) package, but it is written in pure Python and isn't fast enough for my purposes. I therefore decided to write it in Cython and shamelessly adapted Keith Schmarz's [Java implementation](https://www.keithschwarz.com/interesting/code/?dir=alias-method).\n\n## Installation\n\n```sh\npip install vose\n```\n\n## Usage\n\nYou first have to initialize a sampler with an array of weights. These weights are not required to sum up to 1.\n\n```py\n>>> import numpy as np\n>>> import vose\n\n>>> weights = np.array([.1, .3, .4, .2])\n>>> sampler = vose.Sampler(weights, seed=42)\n\n```\n\nYou can then call the `.sample()` method to sample a random integer in range `[0, n - 1]`, where `n` is the number of weights that were passed.\n\n```py\n>>> sampler.sample()\n2\n\n```\n\nYou can set the `k` parameter in order to produce multiple samples.\n\n```py\n>>> sampler.sample(k=10)\narray([1, 1, 2, 1, 2, 1, 0, 1, 3, 3])\n\n```\n\nBy default, a copy of the weights is made. You can disable this behavior in order to save a few microseconds, but this will modify the provided array.\n\n```py\n>>> sampler = vose.Sampler(weights, seed=42, copy=False)\n\n```\n\nNote that `vose.Sampler` expects to be provided with a [memoryview](https://docs.python.org/3/c-api/memoryview.html). In order to pass a list, you have to convert it yourself to a numpy array.\n\n```py\n>>> weights = [.2, .3, .5]\n>>> sampler = vose.Sampler(np.array(weights))\n\n```\n\nYou can also use `vose.Sampler` from within your own Cython `.pyx` file:\n\n```py\nimport numpy as np\n\ncimport vose\ncimport numpy as np\n\ncdef np.float [:] weights = np.array([.2, .3, .5], dtype=float)\ncdef vose.Sampler sampler\nsampler = vose.Sampler(weights)\n\ncdef int sample = sampler.sample_1()\ncdef np.int [:] samples = sampler.sample_k(10)\n```\n\nNote that the latter requires having to include the `numpy` headers in the extension definition of your `setup.py`:\n\n```py\nfrom setuptools import Extension\nfrom setuptools import setup\nfrom Cython.Build import cythonize\nimport numpy as np\n\nextension = Extension(\n    '*', ['your_file.pyx'],\n    include_dirs=[np.get_include()],\n    define_macros=[('NPY_NO_DEPRECATED_API', 'NPY_1_7_API_VERSION')]\n)\n\nsetup(ext_modules=cythonize([extension]))\n```\n\n## Is it reliable?\n\nIt seems to be working correctly; at least according to the following [chi-squared tests](https://www.wikiwand.com/en/Chi-squared_test):\n\n```py\n>>> import numpy as np\n>>> from scipy import stats\n\n>>> rng = np.random.RandomState(42)\n>>> k = 1000\n\n>>> for n in range(3, 20):\n...     weights = rng.dirichlet(np.arange(1, n))\n...     sampler = vose.Sampler(weights, seed=42)\n...     samples = sampler.sample(k)\n...     chi2 = stats.chisquare(f_obs=np.bincount(samples), f_exp=weights * k)\n...     assert chi2.pvalue > .05\n\n```\n\n## Is it fast?\n\nHell yeah. The following graph shows the average time taken to sample one integer for different amounts of weights:\n\n<div align=\"center\">\n    <img width=\"60%\" src=\"figures/sampling_time.svg\">\n</div>\n<br>\n\nAs you can see, `vose.Sampler` takes less than a nanosecond to produce a random integer. Here is the construction time:\n\n<div align=\"center\">\n    <img width=\"60%\" src=\"figures/construction_time.svg\">\n</div>\n<br>\n\n`vose.Sampler` is also fast enough to compete with `numpy` and `random`, even when including the construction time. The following table summarizes a comparison I made on my laptop, with `n` being the number of weights and `k` the number of integers to sample:\n\n|    n |    k | np.random.choice | random.choices | vose.Sampler | vose_sampler.VoseAlias |\n| ---: | ---: | :--------------- | :------------- | :----------- | :--------------------- |\n|    3 |    1 | 26ns             | 2ns            | 4ns          | 11ns                   |\n|    3 |    2 | 26ns             | 3ns            | 7ns          | 13ns                   |\n|    3 |    3 | 26ns             | 3ns            | 7ns          | 14ns                   |\n|    3 |   10 | 26ns             | 6ns            | 7ns          | 27ns                   |\n|    3 |  100 | 28ns             | 47ns           | 8ns          | 198ns                  |\n|    3 | 1000 | 46ns             | 461ns          | 19ns         | 1\u03bcs, 887ns             |\n|   30 |    1 | 27ns             | 6ns            | 4ns          | 69ns                   |\n|   30 |    2 | 26ns             | 7ns            | 7ns          | 73ns                   |\n|   30 |    3 | 27ns             | 7ns            | 8ns          | 72ns                   |\n|   30 |   10 | 27ns             | 14ns           | 7ns          | 88ns                   |\n|   30 |  100 | 31ns             | 63ns           | 8ns          | 256ns                  |\n|   30 | 1000 | 67ns             | 580ns          | 19ns         | 1\u03bcs, 935ns             |\n|  300 |    1 | 29ns             | 47ns           | 6ns          | 661ns                  |\n|  300 |    2 | 29ns             | 47ns           | 9ns          | 659ns                  |\n|  300 |    3 | 29ns             | 49ns           | 9ns          | 685ns                  |\n|  300 |   10 | 29ns             | 54ns           | 9ns          | 685ns                  |\n|  300 |  100 | 36ns             | 112ns          | 10ns         | 877ns                  |\n|  300 | 1000 | 96ns             | 717ns          | 20ns         | 2\u03bcs, 599ns             |\n| 3000 |    1 | 52ns             | 416ns          | 18ns         | 6\u03bcs, 988ns             |\n| 3000 |    2 | 50ns             | 420ns          | 21ns         | 7\u03bcs, 39ns              |\n| 3000 |    3 | 51ns             | 439ns          | 21ns         | 7\u03bcs, 102ns             |\n| 3000 |   10 | 51ns             | 420ns          | 21ns         | 7\u03bcs, 332ns             |\n| 3000 |  100 | 59ns             | 496ns          | 23ns         | 7\u03bcs, 349ns             |\n| 3000 | 1000 | 137ns            | 1\u03bcs, 213ns     | 35ns         | 10\u03bcs, 190ns            |\n\nIn summary, you probably don't need to be using `vose.Sampler` if you only need to sample once, regardless of the number of integers you wish to sample. You want to use `vose.Sampler` when you need to sample in a sequential manner, because at that point the construction time will be amortized. Indeed, this will bring you two orders of magnitude improved speed, when compared to calling `np.random.choice` or `random.choices` multiple times.\n\n## Development\n\n```sh\ngit clone https://github.com/MaxHalford/vose\ncd vose\npython setup.py build_ext --inplace\npytest\n```\n\n## Further work\n\n- The weights assigned to each integer cannot be modified. A [search tree](https://www.wikiwand.com/en/Search_tree) can be used as a data structure that supports modifications. This allows modifying weights in `O(log(n))` time, but means sampling also happens in `O(log(n))` time. More information [here](https://stackoverflow.com/questions/34247459/an-efficient-version-alternative-to-the-alias-method-that-samples-without-replac).\n- I'm not 100% the memory allocation of the memoryviews is optimal.\n- Initializing a `vose.Sampler` from another Cython `.pyx` file seems to require some Python interaction; there's probably a way to avoid this.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Cython implementation of Vose's Alias method.",
    "version": "0.0.2",
    "project_urls": {
        "Homepage": "https://github.com/MaxHalford/vose"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "49c856c6a060de3a2d0dd42f2b9d1505108e9ef7d4cccb0df93d3dedfbebedf3",
                "md5": "8a219dea1364bb1f67c3bec06ce124db",
                "sha256": "46a96f7593fad7821a6fca80135791393126322bc9ac6ddd9840fd555fa2c693"
            },
            "downloads": -1,
            "filename": "vose-0.0.2.tar.gz",
            "has_sig": false,
            "md5_digest": "8a219dea1364bb1f67c3bec06ce124db",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 168368,
            "upload_time": "2024-08-23T08:50:51",
            "upload_time_iso_8601": "2024-08-23T08:50:51.666109Z",
            "url": "https://files.pythonhosted.org/packages/49/c8/56c6a060de3a2d0dd42f2b9d1505108e9ef7d4cccb0df93d3dedfbebedf3/vose-0.0.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-08-23 08:50:51",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "MaxHalford",
    "github_project": "vose",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "vose"
}
        
Elapsed time: 0.34249s