# turingpoint
Turing point is a Reinforcement Learning (RL) library. It adds the missing duct tape.
It allows for multiple (hetrogenous) agents seamlessly. Per-agent partial observation is natural with Turing point.
Different agents can act in differnet frequencies.
You may opt to continue using also the environment and the agent libraries that you're currently using, for the such as Gym/Gymnasium, Stable-Baselines3, Tianshou, RLLib, etc.
Turing point integrates easily with existing RL libraries and your own custom code.
Integration of RL agents in the target applications should be significantly easier with Turing point.
The main concept in Turing point is that there are multiple participants and each gets its turn.
The participants communicate by a parcel that is passed among them. The agent and the environment are both participants in that sense. No more confusion which of those should call which. Reward's logic, for example,
can be addressed where you believe is most suitable.
Turing point may be helpful with parallel or distributed training, yet Turing point does not address those explicitly. On the contrary; with Turing point the flow is sequential among the participants. As far as we can tell Turing point at least does not hinder the use of parallel and / or distributed training.
Participants can be added and / or removed dynamically (ex. a new "monster" enters or then "disappears").
Consider a Gym/SB3 training realm:
```python
import gym
from stable_baselines3 import A2C
# Creating the specific Gym environment.
env = gym.make("CartPole-v1")
# An agent is created, it is injected with the environment.
# The agent probably makes a copy of the passed environment, wraps it etc.
model = A2C("MlpPolicy", env, verbose=1)
# The agent is trained against its environment.
# We can assume what is happening there (obs, action, reward, obs, ..), yet it is not explicit.
model.learn(total_timesteps=10_000)
# we now evaluate the performance of our agent with the help of the environment that the agent maintains.
vec_env = model.get_env()
obs = vec_env.reset()
for i in range(1000):
# The parameter for predict is the observation,
# which is good as our application (ex. an actual cartpole robot) can indeed provide such observations and use the return action.
# Note: the action space as well as the observation space are defined in the environment.
# Also note. The environment is aware of the agent. This is how the environment was designed.
# The action space of the agent is coded in the environment.
# The observation space is intended for the agent and reflects probably also what the agent should know about itself.
# The _state output is related to RNNs, AFAIK.
action, _state = model.predict(obs, deterministic=True)
# Here the reward, done, and info outputs are just for our evaluation.
# Mainly what is happening here is that the environment moves to a new state.
# The reward and done flag, are specific to the agent.
# If there are other entities in the environments, those may continue to live also after done=True and may not care (directly) about this specific reward.
obs, reward, done, info = vec_env.step(action)
# We render here. We did not render during the training(learn) which probably makes sense performace wise.
vec_env.render()
# VecEnv resets automatically
# if done:
# obs = vec_env.reset()
# Observation: we reset the environment. The model is supposed to be memory-less (MDP assumption).
```
In the comments above, we've tried to give the intuition why some additional thinking is needed about
the software that is used to provision those environment / agent(s) realms.
Let's see how above can be described with Turing point:
```python
...
import turingpoint.gymnasium_utils as tp_gym_utils
import turingpoint.sb3_utils as tp_sb3_utils
import turingpoint.utils as tp_utils
import turingpoint as tp
def evaluate(env, agent, num_episodes: int) -> float:
rewards_collector = tp_utils.Collector(['reward'])
def get_participants():
yield functools.partial(tp_gym_utils.call_reset, env=env)
yield from itertools.cycle([
functools.partial(tp_sb3_utils.call_predict, agent=agent, deterministic=True),
functools.partial(tp_gym_utils.call_step, env=env),
rewards_collector,
tp_gym_utils.check_done
])
evaluate_assembly = tp.Assembly(get_participants)
for _ in range(num_episodes):
_ = evaluate_assembly.launch()
# Note that we don't clear the rewards in 'rewards_collector', and so we continue to collect.
total_reward = sum(x['reward'] for x in rewards_collector.get_entries())
return total_reward / num_episodes
..
def main():
random.seed(1)
np.random.seed(1)
torch.manual_seed(1)
env = gym.make('CartPole-v1')
env.reset(seed=1)
agent = PPO(MlpPolicy, env, verbose=0) # use verbose=1 for debugging
mean_reward_before_train = evaluate(env, agent, 100)
print("before training")
print(f'{mean_reward_before_train=}')
..
```
What did we gain and was it worth the extra coding? Let's add to the environment a second agent, wind, or maybe it is part of the augmented environment, does not really matter. Let's just add it.
```python
..
def wind(parcel: dict) -> None:
action_wind = "blow left" if random() < 0.5 else "blow right"
parcel['action_wind'] = action_wind
def wind_impact(parcel: dict) -> None:
action_wind = parcel['action_wind']
# We'll modify the action of the agent, given the wind,
# as we don't have here access to the state of the environment.
parcel['action'] = ...
def evaluate(env, agent, num_episodes: int) -> float:
rewards_collector = tp_utils.Collector(['reward'])
def get_participants():
yield functools.partial(tp_gym_utils.call_reset, env=env)
yield from itertools.cycle([
functools.partial(tp_sb3_utils.call_predict, agent=agent, deterministic=True),
wind,
wind_impact,
functools.partial(tp_gym_utils.call_step, env=env),
rewards_collector,
tp_gym_utils.check_done
])
evaluate_assembly = tp.Assembly(get_participants)
for _ in range(num_episodes):
_ = evaluate_assembly.launch()
# Note that we don't clear the rewards in 'rewards_collector', and so we continue to collect.
total_reward = sum(x['reward'] for x in rewards_collector.get_entries())
return total_reward / num_episodes
```
To install use for example:
```
pip install turingpoint
```
The examples are found in the homepage (github) under the 'examples' folder.
Raw data
{
"_id": null,
"home_page": "https://github.com/zbenmo/turingpoint",
"name": "turingpoint",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": "Reinforcement Learning, Framework, Integration",
"author": "Oren Zeev-Ben-Mordehai",
"author_email": "zbenmo@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/9a/10/20054dfbcc66e255789de2bb85c66900bb69689f2446456bdf2f8cd482d9/turingpoint-0.2.1.tar.gz",
"platform": null,
"description": "# turingpoint\n\nTuring point is a Reinforcement Learning (RL) library. It adds the missing duct tape.\nIt allows for multiple (hetrogenous) agents seamlessly. Per-agent partial observation is natural with Turing point.\nDifferent agents can act in differnet frequencies.\nYou may opt to continue using also the environment and the agent libraries that you're currently using, for the such as Gym/Gymnasium, Stable-Baselines3, Tianshou, RLLib, etc.\nTuring point integrates easily with existing RL libraries and your own custom code.\nIntegration of RL agents in the target applications should be significantly easier with Turing point.\n\nThe main concept in Turing point is that there are multiple participants and each gets its turn.\nThe participants communicate by a parcel that is passed among them. The agent and the environment are both participants in that sense. No more confusion which of those should call which. Reward's logic, for example,\ncan be addressed where you believe is most suitable.\n\nTuring point may be helpful with parallel or distributed training, yet Turing point does not address those explicitly. On the contrary; with Turing point the flow is sequential among the participants. As far as we can tell Turing point at least does not hinder the use of parallel and / or distributed training.\n\nParticipants can be added and / or removed dynamically (ex. a new \"monster\" enters or then \"disappears\").\n\nConsider a Gym/SB3 training realm:\n\n```python\nimport gym\n\nfrom stable_baselines3 import A2C\n\n# Creating the specific Gym environment.\nenv = gym.make(\"CartPole-v1\")\n\n# An agent is created, it is injected with the environment.\n# The agent probably makes a copy of the passed environment, wraps it etc.\nmodel = A2C(\"MlpPolicy\", env, verbose=1)\n\n# The agent is trained against its environment.\n# We can assume what is happening there (obs, action, reward, obs, ..), yet it is not explicit.\nmodel.learn(total_timesteps=10_000)\n\n# we now evaluate the performance of our agent with the help of the environment that the agent maintains.\nvec_env = model.get_env()\nobs = vec_env.reset()\nfor i in range(1000):\n # The parameter for predict is the observation,\n # which is good as our application (ex. an actual cartpole robot) can indeed provide such observations and use the return action.\n # Note: the action space as well as the observation space are defined in the environment.\n # Also note. The environment is aware of the agent. This is how the environment was designed.\n # The action space of the agent is coded in the environment.\n # The observation space is intended for the agent and reflects probably also what the agent should know about itself.\n # The _state output is related to RNNs, AFAIK.\n action, _state = model.predict(obs, deterministic=True)\n # Here the reward, done, and info outputs are just for our evaluation.\n # Mainly what is happening here is that the environment moves to a new state.\n # The reward and done flag, are specific to the agent.\n # If there are other entities in the environments, those may continue to live also after done=True and may not care (directly) about this specific reward.\n obs, reward, done, info = vec_env.step(action)\n # We render here. We did not render during the training(learn) which probably makes sense performace wise.\n vec_env.render()\n # VecEnv resets automatically\n # if done:\n # obs = vec_env.reset()\n\n# Observation: we reset the environment. The model is supposed to be memory-less (MDP assumption). \n```\n\nIn the comments above, we've tried to give the intuition why some additional thinking is needed about\nthe software that is used to provision those environment / agent(s) realms.\n\nLet's see how above can be described with Turing point:\n\n```python\n...\nimport turingpoint.gymnasium_utils as tp_gym_utils\nimport turingpoint.sb3_utils as tp_sb3_utils\nimport turingpoint.utils as tp_utils\nimport turingpoint as tp\n\n\ndef evaluate(env, agent, num_episodes: int) -> float:\n\n rewards_collector = tp_utils.Collector(['reward'])\n\n def get_participants():\n yield functools.partial(tp_gym_utils.call_reset, env=env)\n yield from itertools.cycle([\n functools.partial(tp_sb3_utils.call_predict, agent=agent, deterministic=True),\n functools.partial(tp_gym_utils.call_step, env=env),\n rewards_collector,\n tp_gym_utils.check_done\n ]) \n\n evaluate_assembly = tp.Assembly(get_participants)\n\n for _ in range(num_episodes):\n _ = evaluate_assembly.launch()\n # Note that we don't clear the rewards in 'rewards_collector', and so we continue to collect.\n\n total_reward = sum(x['reward'] for x in rewards_collector.get_entries())\n\n return total_reward / num_episodes\n\n..\n\ndef main():\n\n random.seed(1)\n np.random.seed(1)\n torch.manual_seed(1)\n\n env = gym.make('CartPole-v1')\n\n env.reset(seed=1)\n\n agent = PPO(MlpPolicy, env, verbose=0) # use verbose=1 for debugging\n\n mean_reward_before_train = evaluate(env, agent, 100)\n print(\"before training\")\n print(f'{mean_reward_before_train=}')\n\n..\n```\n\nWhat did we gain and was it worth the extra coding? Let's add to the environment a second agent, wind, or maybe it is part of the augmented environment, does not really matter. Let's just add it.\n\n```python\n..\n\ndef wind(parcel: dict) -> None:\n action_wind = \"blow left\" if random() < 0.5 else \"blow right\"\n parcel['action_wind'] = action_wind\n\n\ndef wind_impact(parcel: dict) -> None:\n action_wind = parcel['action_wind']\n # We'll modify the action of the agent, given the wind,\n # as we don't have here access to the state of the environment.\n parcel['action'] = ...\n\n\ndef evaluate(env, agent, num_episodes: int) -> float:\n\n rewards_collector = tp_utils.Collector(['reward'])\n\n def get_participants():\n yield functools.partial(tp_gym_utils.call_reset, env=env)\n yield from itertools.cycle([\n functools.partial(tp_sb3_utils.call_predict, agent=agent, deterministic=True),\n wind,\n wind_impact,\n functools.partial(tp_gym_utils.call_step, env=env),\n rewards_collector,\n tp_gym_utils.check_done\n ]) \n\n evaluate_assembly = tp.Assembly(get_participants)\n\n for _ in range(num_episodes):\n _ = evaluate_assembly.launch()\n # Note that we don't clear the rewards in 'rewards_collector', and so we continue to collect.\n\n total_reward = sum(x['reward'] for x in rewards_collector.get_entries())\n\n return total_reward / num_episodes\n```\n\nTo install use for example:\n\n```\npip install turingpoint\n```\n\nThe examples are found in the homepage (github) under the 'examples' folder.\n",
"bugtrack_url": null,
"license": null,
"summary": "Reinforcement Learning (RL) library",
"version": "0.2.1",
"project_urls": {
"Homepage": "https://github.com/zbenmo/turingpoint"
},
"split_keywords": [
"reinforcement learning",
" framework",
" integration"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "7f3be4a369e4c9b8a91a99af4aa7f3f4c4f11f9d36a0267f1dacf2f1154a36d2",
"md5": "7e20142699ea9f5183639bbd267148e9",
"sha256": "7ea6e06e90764fe41c9300c2178516c4e5c85e31547e38cd57e44c865cf3ba4f"
},
"downloads": -1,
"filename": "turingpoint-0.2.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "7e20142699ea9f5183639bbd267148e9",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 12018,
"upload_time": "2024-11-18T21:34:39",
"upload_time_iso_8601": "2024-11-18T21:34:39.531006Z",
"url": "https://files.pythonhosted.org/packages/7f/3b/e4a369e4c9b8a91a99af4aa7f3f4c4f11f9d36a0267f1dacf2f1154a36d2/turingpoint-0.2.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "9a1020054dfbcc66e255789de2bb85c66900bb69689f2446456bdf2f8cd482d9",
"md5": "45c6cc9e2b394bd7b1727640bcfa2830",
"sha256": "5f395f549af66ddd853bbf69ca64ed5b5707505fc6fabf16415dcd9aa5e32a03"
},
"downloads": -1,
"filename": "turingpoint-0.2.1.tar.gz",
"has_sig": false,
"md5_digest": "45c6cc9e2b394bd7b1727640bcfa2830",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 12445,
"upload_time": "2024-11-18T21:34:41",
"upload_time_iso_8601": "2024-11-18T21:34:41.501403Z",
"url": "https://files.pythonhosted.org/packages/9a/10/20054dfbcc66e255789de2bb85c66900bb69689f2446456bdf2f8cd482d9/turingpoint-0.2.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-11-18 21:34:41",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "zbenmo",
"github_project": "turingpoint",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "turingpoint"
}