# Buffalo Gym
A multi-armed bandit (MAB) environment for the gymnasium API.
One-armed Bandit is a reference to slot machines, and Buffalo
is a reference to one such slot machine that I am fond
of. MABs are an excellent playground for theoretical exercise and
debugging of RL agents as they provide an environment that
can be reasoned about easily. It helped me once to step back
and write an MAB to debug my DQN agent. But there was a lack
of native gymnasium environments, so I wrote Buffalo, an easy-to-use
environment that it might help someone else.
## Standard Bandit Problems
### Buffalo ("Buffalo-v0" | "Bandit-v0")
Default multi-armed bandit environment. Arm center values
are drawn from a normal distribution (0, arms). When an
arm is pulled, a random value is drawn from a normal
distribution (0, 1) and added to the chosen arm center
value. This is not intended to be challenging for an agent but
easy for the debugger to reason about.
### Multi-Buffalo ("MultiBuffalo-v0" | "ContextualBandit-v0")
This serves as a contextual bandit implementation. It is a
k-armed bandit with n states. These states are indicated to
the agent in the observation and the two states have different
reward offsets for each arm. The goal of the agent is to
learn and contextualize best action for a given state. This is
a good stepping stone to Markov Decision Processes.
This module had an extra parameter, pace. By default (None), a
new state is chosen for every step of the environment. It can
be set to any integer to determine how many steps between randomly
choosing a new state. Of course, transitioning to a new state is
not guaranteed as the next state is random.
### DuelingBuffalo ("DuelingBuffalo-v0" | "DuelingBandit-v0")
Yue et al. (2012) introduced the dueling bandit variant to model
situations with only relative feedback. The agent pulls two levers
simultaneously; the feedback is whichever lever provides the best
reward. This restriction means the agent cannot observe rewards
and must continually compare arms to determine the best. Given
the reward-centric structure of gymnasium returns, we instead
give a reward of 1 if the first arm chosen was higher than the
second. The agent must choose two arms, which cannot be the same.
### BoundlessBuffalo ("BoundlessBuffalo-v0" | "InfiniteArmedBandit-v0")
Built from the Wikipedia entry based on Agrawal, 1995 (Paywalled),
BoundlessBuffalo approximates the InfiniteArmedBandit problem.
The reward for this bandit is the action put into a polynomial of
degree n, with the coefficients randomly sampled from (-0.1, 0.1).
This environment tests the ability of an algorithm to find an optimal
input in a continuous space. The dynamic drawing of new coefficients
challenges algorithms to adapt to a changing landscape continually.
## Nonstandard Bandit Problems
### Buffalo Trail ("BuffaloTrail-v0" | "StatefulBandit-v0")
A Stateful Bandit builds on the Contextual Bandit by relaxing
the assumption that rewards depend only on the current state.
In this framework, the environment incorporates a memory of past
states, rewarding the maximum to an agent only if it encounters a
specific sequence of states and selects the correct action.
This setup isolates an agent's ability to track history and infer
belief states, without introducing the confounding factor of
exploration, as the agent cannot control state transitions. Stateful
Bandits provide a targeted environment for studying history-dependent
decision-making and state estimation.
### Symbolic State ("SymbolicStateBandit-v0")
In real slots, the state of the bandit has little to no impact on
the underlying rewards. Plenty of flashing lights and game modes
serve only to keep the player engaged. This SymbolicStateBandit
(SSB) formulation simulates this. The states do not correlate
with the underlying rewards in this contextual bandit.
By setting dynamic_rate to None, the rewards are always the same
despite the changing states; dynamic_rate == pace randomly changes
the arms with each state, and any other values produce further
uncorrelated behavior. This configuration serves as a test bed for
the "worst case" scenario for a bandit/reinforcement learner. It
measures the agent's ability to generalize well and/or how it performs
when the environment breaks the typical assumptions.
## Using
Install via pip and import buffalo_gym along with gymnasium.
```
import gymnasium
import buffalo_gym
env = gym.make("Buffalo-v0")
```
Raw data
{
"_id": null,
"home_page": null,
"name": "buffalo-gym",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": "gymnasium, gym",
"author": "foreverska",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/75/a1/6b484625a41744ff6528553d05cd6cc8b0782804dbe506f856b2ff5944d1/buffalo_gym-0.2.0.tar.gz",
"platform": null,
"description": "# Buffalo Gym\n\nA multi-armed bandit (MAB) environment for the gymnasium API.\nOne-armed Bandit is a reference to slot machines, and Buffalo \nis a reference to one such slot machine that I am fond \nof. MABs are an excellent playground for theoretical exercise and \ndebugging of RL agents as they provide an environment that \ncan be reasoned about easily. It helped me once to step back \nand write an MAB to debug my DQN agent. But there was a lack \nof native gymnasium environments, so I wrote Buffalo, an easy-to-use \n environment that it might help someone else.\n\n## Standard Bandit Problems\n\n### Buffalo (\"Buffalo-v0\" | \"Bandit-v0\")\n\nDefault multi-armed bandit environment. Arm center values \nare drawn from a normal distribution (0, arms). When an \narm is pulled, a random value is drawn from a normal \ndistribution (0, 1) and added to the chosen arm center \nvalue. This is not intended to be challenging for an agent but \neasy for the debugger to reason about.\n\n### Multi-Buffalo (\"MultiBuffalo-v0\" | \"ContextualBandit-v0\")\n\nThis serves as a contextual bandit implementation. It is a \nk-armed bandit with n states. These states are indicated to \nthe agent in the observation and the two states have different \nreward offsets for each arm. The goal of the agent is to \nlearn and contextualize best action for a given state. This is \na good stepping stone to Markov Decision Processes.\n\nThis module had an extra parameter, pace. By default (None), a \nnew state is chosen for every step of the environment. It can \nbe set to any integer to determine how many steps between randomly \nchoosing a new state. Of course, transitioning to a new state is \nnot guaranteed as the next state is random.\n\n### DuelingBuffalo (\"DuelingBuffalo-v0\" | \"DuelingBandit-v0\")\n\nYue et al. (2012) introduced the dueling bandit variant to model \nsituations with only relative feedback. The agent pulls two levers \nsimultaneously; the feedback is whichever lever provides the best \nreward. This restriction means the agent cannot observe rewards \nand must continually compare arms to determine the best. Given \nthe reward-centric structure of gymnasium returns, we instead \ngive a reward of 1 if the first arm chosen was higher than the \nsecond. The agent must choose two arms, which cannot be the same.\n\n### BoundlessBuffalo (\"BoundlessBuffalo-v0\" | \"InfiniteArmedBandit-v0\")\n\nBuilt from the Wikipedia entry based on Agrawal, 1995 (Paywalled), \nBoundlessBuffalo approximates the InfiniteArmedBandit problem. \nThe reward for this bandit is the action put into a polynomial of \ndegree n, with the coefficients randomly sampled from (-0.1, 0.1). \nThis environment tests the ability of an algorithm to find an optimal \ninput in a continuous space. The dynamic drawing of new coefficients \nchallenges algorithms to adapt to a changing landscape continually.\n\n## Nonstandard Bandit Problems\n\n### Buffalo Trail (\"BuffaloTrail-v0\" | \"StatefulBandit-v0\")\n\nA Stateful Bandit builds on the Contextual Bandit by relaxing \nthe assumption that rewards depend only on the current state. \nIn this framework, the environment incorporates a memory of past \nstates, rewarding the maximum to an agent only if it encounters a \nspecific sequence of states and selects the correct action.\n\nThis setup isolates an agent's ability to track history and infer \nbelief states, without introducing the confounding factor of \nexploration, as the agent cannot control state transitions. Stateful \nBandits provide a targeted environment for studying history-dependent \ndecision-making and state estimation.\n\n### Symbolic State (\"SymbolicStateBandit-v0\")\n\nIn real slots, the state of the bandit has little to no impact on \nthe underlying rewards. Plenty of flashing lights and game modes \nserve only to keep the player engaged. This SymbolicStateBandit \n(SSB) formulation simulates this. The states do not correlate \nwith the underlying rewards in this contextual bandit.\n\nBy setting dynamic_rate to None, the rewards are always the same \ndespite the changing states; dynamic_rate == pace randomly changes \nthe arms with each state, and any other values produce further \nuncorrelated behavior. This configuration serves as a test bed for \nthe \"worst case\" scenario for a bandit/reinforcement learner. It \nmeasures the agent's ability to generalize well and/or how it performs \nwhen the environment breaks the typical assumptions.\n\n## Using\n\nInstall via pip and import buffalo_gym along with gymnasium.\n\n```\nimport gymnasium \nimport buffalo_gym\n\nenv = gym.make(\"Buffalo-v0\")\n```\n",
"bugtrack_url": null,
"license": null,
"summary": "Buffalo Gym environment",
"version": "0.2.0",
"project_urls": {
"Github:": "https://github.com/foreverska/buffalo-gym"
},
"split_keywords": [
"gymnasium",
" gym"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "24d1102a6b1d0018b5dfda4d7cdda063316477c392ab8381c57fe571846403c6",
"md5": "3c1b456759a4827c5f1bd54e8bdd5360",
"sha256": "af2382cb5b0ba5f78f04f2ff7752da27a64d24cd542cdcdb9e86f4cc415100a6"
},
"downloads": -1,
"filename": "buffalo_gym-0.2.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "3c1b456759a4827c5f1bd54e8bdd5360",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 14085,
"upload_time": "2024-12-26T19:26:39",
"upload_time_iso_8601": "2024-12-26T19:26:39.946780Z",
"url": "https://files.pythonhosted.org/packages/24/d1/102a6b1d0018b5dfda4d7cdda063316477c392ab8381c57fe571846403c6/buffalo_gym-0.2.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "75a16b484625a41744ff6528553d05cd6cc8b0782804dbe506f856b2ff5944d1",
"md5": "00142ea6999430b98d7ac2ef6c4970d9",
"sha256": "7802a6c140ae17742c61de5b1994ad989e1cf37c362ed9aaac94ae723d233b5e"
},
"downloads": -1,
"filename": "buffalo_gym-0.2.0.tar.gz",
"has_sig": false,
"md5_digest": "00142ea6999430b98d7ac2ef6c4970d9",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 10974,
"upload_time": "2024-12-26T19:26:40",
"upload_time_iso_8601": "2024-12-26T19:26:40.949556Z",
"url": "https://files.pythonhosted.org/packages/75/a1/6b484625a41744ff6528553d05cd6cc8b0782804dbe506f856b2ff5944d1/buffalo_gym-0.2.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-12-26 19:26:40",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "foreverska",
"github_project": "buffalo-gym",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "gymnasium",
"specs": [
[
"~=",
"0.29.1"
]
]
},
{
"name": "numpy",
"specs": [
[
"~=",
"1.24.3"
]
]
},
{
"name": "setuptools",
"specs": [
[
">=",
"70.0.0"
]
]
}
],
"lcname": "buffalo-gym"
}