# Actors: Multi‑(Agent, Turn, Env) RL
<p align="center">
<img src="https://i.imgur.com/Mk0fSSa.png" alt="Long Banner" width="400">
</p>
<p align="center">
A hackable library for doing <strong>Multi‑Turn Multi‑Agent RL</strong> with LLMs for the <strong>GPU poor</strong> and <strong>middle class</strong>. Supports some fun environments and makes it very easy to add new ones.
</p>
<p align="center">
<a href="https://huggingface.co/rl-actors">
<img alt="Hugging Face Hub" src="https://img.shields.io/badge/🤗%20Hub-RL--Actors-yellow">
</a>
<a href="https://pypi.org/project/rl-actors/">
<img alt="PyPI" src="https://img.shields.io/pypi/v/rl-actors">
</a>
</p>
---
## Multi‑Trainable‑Agents
This library supports training **multiple different** models together using [Accelerate](https://huggingface.co/docs/accelerate/en/usage_guides/deepspeed_multiple_model).
This allows you to do some very fun stuff, such as adversarial training, collaborative problem solving, multi‑agent collaboration, etc.
Here is a quick simplified example for collaborative problem solving:
```python
# 2 completely different models, both trainable.
bob_actor = vLLMActor(
name="Bob",
model_path="Qwen/Qwen2.5-7B-Instruct",
)
alice_actor = vLLMActor(
name="Alice",
model_path="meta-llama/Llama-3.1-8B-Instruct",
)
# Loading a math dataset
ds = load_dataset('rl-actors/GSM8K-Easy-Math')
# In this environment they will take turns improving their solution.
env = CollaborativeEnvironment(
actor_cfgs=[
CollaborativeActorConfig(
actor=alice_actor,
system_prompt="You are Alice",
),
CollaborativeActorConfig(
actor=bob_actor,
system_prompt="You are Bob",
),
],
reward_functions=[
# Omitted for brevity.
],
# The order of the rounds is specified with a tiny DSL.
# Bob starts and then Alice followed by random 5 turns.
round_spec='Bob -> Alice -> (Bob/Alice)*5',
train_dataset=ds
)
```
---
## Installation
```bash
git clone https://github.com/RD211/actors.git
pip install .
```
You should always run the code with **accelerate** using a **ZeRO‑3** configuration to be able to use all the features of the library.
```bash
accelerate launch --config_file zero3.yaml your_script.py
```
The library uses **Accelerate**, **DeepSpeed**, **bitsandbytes**, **vLLM**, and **PEFT**, and supports **LoRA** and **QLoRA** training.
Some quickstart examples can be found at `examples/`.
---
## Environments
We plan to have the following environments; suggestions for new environments are welcome:
| Category | Environment | Status | Description |
| ---------------------- | --------------------------------- | :----: | -------------------------------------------------------------------------------------------------------------------------- |
| Single Trainable Agent | **SingleTurnEnvironment** | ✅ | Standard environment with only one actor and one turn. |
| Multi Trainable Agent | **CollaborativeEnvironment** | ✅ | Iterates on a task together in alternating turns. |
| Multi Trainable Agent | **ParallelEnvironment** | ⏳ | Samples multiple solutions in parallel and combines them at the end. This is probably what Grok 4 Heavy does. |
| Fun Environments | **JailbreakEnvironment** | ⏳ | One trainable actor tries to convince a frozen actor to do unsafe things from this [dataset](rl-actors/Jailbreak-dataset). |
| Fun Environments | **CodeforcesParallelEnvironment** | ⏳ | Same as the parallel environment but with code execution feedback. |
### Creating a new environment
It is pretty easy to add a new environment, and we recommend making a new environment rather than trying to adapt the current environments for specific tasks.
```python
class CustomEnv(Environment):
async def generate(self, batch: Map[str, Any]) -> EnvironmentOutput:
# 1. Sample using your actor.
problems = batch['problem']
generations = await alice_actor.agenerate(problems)
txt_gen = [gen.outputs[0].text for gen in generations]
# 2. Give rewards (simplified).
answers = batch['answer']
rewards = [int(answer in txt) for answer, txt in zip(answers, txt_gen)]
# 3. We now return the environment results.
tok = alice_actor.tokenizer
alice_output = ActorOutput(
input_ids=tok(txt_gen)['input_ids'],
rewards=rewards,
)
return EnvironmentOutput(
actors={'Alice': alice_output},
)
```
### Combining environments
Combining environments is pretty cool. There are two major use cases we see:
* Training on multiple different tasks with different rewards and completely different goals. Coding + Math, Coding + Creative Writing, etc.
* Easily adding evaluation environments to your training.
Here are some examples:
```python
# Training env for Codeforces.
codeforces_env = CodeforcesParallelEnvironment(
actors=[bob_actor],
reward_functions=[codeforces_reward]
)
# Training env for math.
math_env = SingleTurnEnvironment(
actors=[bob_actor],
reward_functions=[math_correctness],
prompt_column='problem',
train_data=load_dataset('rl-actors/GSM8K-Easy-Math', split='train'),
eval_data={
'gsm8k': load_dataset('rl-actors/GSM8K-Easy-Math', split='test')
}
)
# Evaluation environment for AIME.
aime_eval = SingleTurnEnvironment(
actors=[bob_actor],
reward_functions=[math_correctness],
prompt_column='problem',
eval_data={
'aime25': load_dataset('math-ai/aime25')
}
)
# Final combined environment.
env = codeforces_env + math_env + aime_eval
```
---
## Rewards
We do not provide many predefined reward functions as of now, but they can be easily created.
The rewards are made to super easily support judges and very complex workflows.
If you create your own environment you do not even need to explicitly create a reward function, as they can just be part of your environment directly.
However, for our predefined environments you can make rewards as follows:
```python
# Single turn reward
@reward_function(name='math_reward', weight=1.0)
def length_reward(prompt: str, completion: str) -> float:
return -len(prompt) / 1024
# We support batched rewards and weights too.
@conversation_reward_function(name='math_reward', weight=1.0, batched=True)
def math_reward(conversations: list,
problem: list, # Dataset field
answer: list, # Also dataset field
actor_names: list # allows actor-specific rewards.
) -> list[float]:
# Batched reward functions are designed for Judges.
# You can use Actors freely in the reward function.
# ...
return rewards
# The parameters for the reward functions are automatically filled in as follows:
# For Single turn you will always get the prompt and completion.
# For Conversation you will always get conversations and actor_names.
# For both of them you will get all dataset attributes too, such as `answer` for math data.
```
---
## Memory efficiency
Training multiple models at the same time requires a lot of careful VRAM management. We have thus implemented the following features:
* Full offloading of optimizer states and parameters. This is done during inference but also when switching between different models during the training part. [More details here.](docs/offloading.md)
* Triton kernel for computing log‑probabilities. Helps with long context a bit. [More details here.](docs/logps_kernel.md)
* [Liger kernels](https://github.com/linkedin/Liger-Kernel) for computing the GRPO loss.
* Efficient streamed implementation for updating vLLM weights along with LoRA in‑memory updates. [More details here.](docs/updating_weights.md)
#### Debugging VRAM
In order to debug memory issues try running with `ACTORS_LOGGING_LEVEL='verbose'`.
Sometimes memory becomes very fragmented and can cause OOM errors when switching to the inference part. You can try running with: `PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.3,max_split_size_mb:64` and it might fix the problem.
Sometimes, after a failed run, memory might remain allocated for a while. Make sure to terminate all previous processes before starting a new run.
---
## RL algorithms
Currently there is a **GRPO** and **[GSPO](https://www.arxiv.org/abs/2507.18071)** implementation. Both implementations have both a torch version and a Liger-Kernel chunked version.
> [!NOTE]
> You can also get a lot of the other implementations such as DAPO, Dr. GRPO just by configuring the existing losses and advantage function.
---
## Actors
We support both hosted API actors and local/trainable actors.
```python
# OpenAI‑style API actor (frozen or for judgment / orchestration)
openai_actor = OpenAIActor(
name="Judge",
api_key=os.environ["OPENAI_API_KEY"],
# base_url can be customized to point at compatible endpoints
)
# Trainable vLLM actors
train_cfg = ActorTrainCfg(
learning_rate=1e-6,
beta=0.01, # Controls KL
peft_config=LoraConfig(r=16), # pass a PEFT/LoRA config if desired
offload_optimizer=True,
offload_model=True,
)
bob = vLLMActor(
name="Bob",
model_path="Qwen/Qwen2.5-7B-Instruct",
gpu_groups=[[0, 1]], # on what GPUs we put the model; allows data‑parallel
training_config=train_cfg,
)
alice = vLLMActor(
name="Alice",
model_path="meta-llama/Llama-3.1-8B-Instruct",
gpu_groups=1,
training_config=train_cfg,
)
```
* The **`gpu_groups`** for the `vLLMActor` are on what GPUs we put the model on, and it allows for data‑parallel.
---
## Inspiration
Inspired by [Verifiers](https://github.com/willccbb/verifiers), [TRL](https://github.com/huggingface/trl), and [OpenRLHF](https://github.com/OpenRLHF/OpenRLHF).
Raw data
{
"_id": null,
"home_page": null,
"name": "rl-actors",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": "reinforcement learning, llm, grpo, ai",
"author": null,
"author_email": "David Dinucu-Jianu <david.dinucujianu@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/b5/b8/3cf3107b9980c5049288dd6175e12dedd26a5fbd0f4005906a7fb85ec776/rl_actors-0.1.0.tar.gz",
"platform": null,
"description": "# Actors: Multi\u2011(Agent, Turn, Env) RL\n\n<p align=\"center\">\n <img src=\"https://i.imgur.com/Mk0fSSa.png\" alt=\"Long Banner\" width=\"400\">\n</p>\n\n<p align=\"center\">\n A hackable library for doing <strong>Multi\u2011Turn Multi\u2011Agent RL</strong> with LLMs for the <strong>GPU poor</strong> and <strong>middle class</strong>. Supports some fun environments and makes it very easy to add new ones.\n</p>\n\n<p align=\"center\">\n <a href=\"https://huggingface.co/rl-actors\">\n <img alt=\"Hugging Face Hub\" src=\"https://img.shields.io/badge/\ud83e\udd17%20Hub-RL--Actors-yellow\">\n </a>\n <a href=\"https://pypi.org/project/rl-actors/\">\n <img alt=\"PyPI\" src=\"https://img.shields.io/pypi/v/rl-actors\">\n </a>\n</p>\n\n---\n\n## Multi\u2011Trainable\u2011Agents\n\nThis library supports training **multiple different** models together using [Accelerate](https://huggingface.co/docs/accelerate/en/usage_guides/deepspeed_multiple_model).\n\nThis allows you to do some very fun stuff, such as adversarial training, collaborative problem solving, multi\u2011agent collaboration, etc.\n\nHere is a quick simplified example for collaborative problem solving:\n\n```python\n# 2 completely different models, both trainable.\nbob_actor = vLLMActor(\n name=\"Bob\",\n model_path=\"Qwen/Qwen2.5-7B-Instruct\",\n)\nalice_actor = vLLMActor(\n name=\"Alice\",\n model_path=\"meta-llama/Llama-3.1-8B-Instruct\",\n)\n\n# Loading a math dataset\nds = load_dataset('rl-actors/GSM8K-Easy-Math')\n\n# In this environment they will take turns improving their solution.\nenv = CollaborativeEnvironment(\n actor_cfgs=[\n CollaborativeActorConfig(\n actor=alice_actor,\n system_prompt=\"You are Alice\",\n ),\n CollaborativeActorConfig(\n actor=bob_actor,\n system_prompt=\"You are Bob\",\n ),\n ],\n reward_functions=[\n # Omitted for brevity.\n ],\n # The order of the rounds is specified with a tiny DSL.\n # Bob starts and then Alice followed by random 5 turns.\n round_spec='Bob -> Alice -> (Bob/Alice)*5',\n train_dataset=ds\n)\n```\n\n---\n\n## Installation\n\n```bash\ngit clone https://github.com/RD211/actors.git\npip install .\n```\n\nYou should always run the code with **accelerate** using a **ZeRO\u20113** configuration to be able to use all the features of the library.\n\n```bash\naccelerate launch --config_file zero3.yaml your_script.py\n```\n\nThe library uses **Accelerate**, **DeepSpeed**, **bitsandbytes**, **vLLM**, and **PEFT**, and supports **LoRA** and **QLoRA** training.\n\nSome quickstart examples can be found at `examples/`.\n\n---\n\n## Environments\n\nWe plan to have the following environments; suggestions for new environments are welcome:\n\n| Category | Environment | Status | Description |\n| ---------------------- | --------------------------------- | :----: | -------------------------------------------------------------------------------------------------------------------------- |\n| Single Trainable Agent | **SingleTurnEnvironment** | \u2705 | Standard environment with only one actor and one turn. |\n| Multi Trainable Agent | **CollaborativeEnvironment** | \u2705 | Iterates on a task together in alternating turns. |\n| Multi Trainable Agent | **ParallelEnvironment** | \u23f3 | Samples multiple solutions in parallel and combines them at the end. This is probably what Grok 4 Heavy does. |\n| Fun Environments | **JailbreakEnvironment** | \u23f3 | One trainable actor tries to convince a frozen actor to do unsafe things from this [dataset](rl-actors/Jailbreak-dataset). |\n| Fun Environments | **CodeforcesParallelEnvironment** | \u23f3 | Same as the parallel environment but with code execution feedback. |\n\n### Creating a new environment\n\nIt is pretty easy to add a new environment, and we recommend making a new environment rather than trying to adapt the current environments for specific tasks.\n\n```python\nclass CustomEnv(Environment):\n async def generate(self, batch: Map[str, Any]) -> EnvironmentOutput:\n # 1. Sample using your actor.\n problems = batch['problem']\n generations = await alice_actor.agenerate(problems)\n txt_gen = [gen.outputs[0].text for gen in generations]\n\n # 2. Give rewards (simplified).\n answers = batch['answer']\n rewards = [int(answer in txt) for answer, txt in zip(answers, txt_gen)]\n\n # 3. We now return the environment results.\n tok = alice_actor.tokenizer\n\n alice_output = ActorOutput(\n input_ids=tok(txt_gen)['input_ids'],\n rewards=rewards,\n )\n\n return EnvironmentOutput(\n actors={'Alice': alice_output},\n )\n```\n\n### Combining environments\n\nCombining environments is pretty cool. There are two major use cases we see:\n\n* Training on multiple different tasks with different rewards and completely different goals. Coding + Math, Coding + Creative Writing, etc.\n* Easily adding evaluation environments to your training.\n\nHere are some examples:\n\n```python\n# Training env for Codeforces.\ncodeforces_env = CodeforcesParallelEnvironment(\n actors=[bob_actor],\n reward_functions=[codeforces_reward]\n)\n\n# Training env for math.\nmath_env = SingleTurnEnvironment(\n actors=[bob_actor],\n reward_functions=[math_correctness],\n prompt_column='problem',\n train_data=load_dataset('rl-actors/GSM8K-Easy-Math', split='train'),\n eval_data={\n 'gsm8k': load_dataset('rl-actors/GSM8K-Easy-Math', split='test')\n }\n)\n\n# Evaluation environment for AIME.\naime_eval = SingleTurnEnvironment(\n actors=[bob_actor],\n reward_functions=[math_correctness],\n prompt_column='problem',\n eval_data={\n 'aime25': load_dataset('math-ai/aime25')\n }\n)\n\n# Final combined environment.\nenv = codeforces_env + math_env + aime_eval\n```\n\n---\n\n## Rewards\n\nWe do not provide many predefined reward functions as of now, but they can be easily created.\nThe rewards are made to super easily support judges and very complex workflows.\nIf you create your own environment you do not even need to explicitly create a reward function, as they can just be part of your environment directly.\n\nHowever, for our predefined environments you can make rewards as follows:\n\n```python\n# Single turn reward\n@reward_function(name='math_reward', weight=1.0)\ndef length_reward(prompt: str, completion: str) -> float:\n return -len(prompt) / 1024\n\n# We support batched rewards and weights too.\n@conversation_reward_function(name='math_reward', weight=1.0, batched=True)\ndef math_reward(conversations: list,\n problem: list, # Dataset field\n answer: list, # Also dataset field\n actor_names: list # allows actor-specific rewards.\n ) -> list[float]:\n # Batched reward functions are designed for Judges.\n # You can use Actors freely in the reward function.\n # ...\n return rewards\n\n# The parameters for the reward functions are automatically filled in as follows:\n# For Single turn you will always get the prompt and completion.\n# For Conversation you will always get conversations and actor_names.\n# For both of them you will get all dataset attributes too, such as `answer` for math data.\n```\n\n---\n\n## Memory efficiency\n\nTraining multiple models at the same time requires a lot of careful VRAM management. We have thus implemented the following features:\n\n* Full offloading of optimizer states and parameters. This is done during inference but also when switching between different models during the training part. [More details here.](docs/offloading.md)\n* Triton kernel for computing log\u2011probabilities. Helps with long context a bit. [More details here.](docs/logps_kernel.md)\n* [Liger kernels](https://github.com/linkedin/Liger-Kernel) for computing the GRPO loss.\n* Efficient streamed implementation for updating vLLM weights along with LoRA in\u2011memory updates. [More details here.](docs/updating_weights.md)\n\n#### Debugging VRAM\n\nIn order to debug memory issues try running with `ACTORS_LOGGING_LEVEL='verbose'`.\n\nSometimes memory becomes very fragmented and can cause OOM errors when switching to the inference part. You can try running with: `PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.3,max_split_size_mb:64` and it might fix the problem.\n\nSometimes, after a failed run, memory might remain allocated for a while. Make sure to terminate all previous processes before starting a new run.\n\n---\n\n## RL algorithms\n\nCurrently there is a **GRPO** and **[GSPO](https://www.arxiv.org/abs/2507.18071)** implementation. Both implementations have both a torch version and a Liger-Kernel chunked version.\n> [!NOTE] \n> You can also get a lot of the other implementations such as DAPO, Dr. GRPO just by configuring the existing losses and advantage function.\n\n---\n\n## Actors\n\nWe support both hosted API actors and local/trainable actors.\n\n```python\n# OpenAI\u2011style API actor (frozen or for judgment / orchestration)\nopenai_actor = OpenAIActor(\n name=\"Judge\",\n api_key=os.environ[\"OPENAI_API_KEY\"],\n # base_url can be customized to point at compatible endpoints\n)\n\n# Trainable vLLM actors\ntrain_cfg = ActorTrainCfg(\n learning_rate=1e-6,\n beta=0.01, # Controls KL\n peft_config=LoraConfig(r=16), # pass a PEFT/LoRA config if desired\n offload_optimizer=True,\n offload_model=True,\n)\n\nbob = vLLMActor(\n name=\"Bob\",\n model_path=\"Qwen/Qwen2.5-7B-Instruct\",\n gpu_groups=[[0, 1]], # on what GPUs we put the model; allows data\u2011parallel\n training_config=train_cfg,\n)\n\nalice = vLLMActor(\n name=\"Alice\",\n model_path=\"meta-llama/Llama-3.1-8B-Instruct\",\n gpu_groups=1,\n training_config=train_cfg,\n)\n```\n\n* The **`gpu_groups`** for the `vLLMActor` are on what GPUs we put the model on, and it allows for data\u2011parallel.\n\n---\n\n## Inspiration\n\nInspired by [Verifiers](https://github.com/willccbb/verifiers), [TRL](https://github.com/huggingface/trl), and [OpenRLHF](https://github.com/OpenRLHF/OpenRLHF).\n",
"bugtrack_url": null,
"license": null,
"summary": "Actors: A hackable library for doing Multi-Turn Multi-Agent RL with LLMs for the GPU poor and middle class.",
"version": "0.1.0",
"project_urls": {
"Bug Tracker": "https://github.com/RD211/actors/issues",
"Documentation": "https://github.com/RD211/actors#readme",
"Homepage": "https://github.com/RD211/actors",
"Repository": "https://github.com/RD211/actors"
},
"split_keywords": [
"reinforcement learning",
" llm",
" grpo",
" ai"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "b67a0ad3853f5a17b8be8469643fb5c3c9a1be3652828cb19411ff3322257eaa",
"md5": "6297e8cae6d798f9bfc9b6b3feb9d29c",
"sha256": "b79a7cc5d72509f71a2998c2b7f85bcfbb4a551e538c2c43a0635d514b0c8309"
},
"downloads": -1,
"filename": "rl_actors-0.1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "6297e8cae6d798f9bfc9b6b3feb9d29c",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 84486,
"upload_time": "2025-07-28T22:37:45",
"upload_time_iso_8601": "2025-07-28T22:37:45.664394Z",
"url": "https://files.pythonhosted.org/packages/b6/7a/0ad3853f5a17b8be8469643fb5c3c9a1be3652828cb19411ff3322257eaa/rl_actors-0.1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "b5b83cf3107b9980c5049288dd6175e12dedd26a5fbd0f4005906a7fb85ec776",
"md5": "eaa37d25932ac551aacca9365e8b7edc",
"sha256": "11ca35773ecfa18ec97aee08dd0567394b9438e95ffaa20fd0427d75e6b993ac"
},
"downloads": -1,
"filename": "rl_actors-0.1.0.tar.gz",
"has_sig": false,
"md5_digest": "eaa37d25932ac551aacca9365e8b7edc",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 79669,
"upload_time": "2025-07-28T22:37:47",
"upload_time_iso_8601": "2025-07-28T22:37:47.230685Z",
"url": "https://files.pythonhosted.org/packages/b5/b8/3cf3107b9980c5049288dd6175e12dedd26a5fbd0f4005906a7fb85ec776/rl_actors-0.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-28 22:37:47",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "RD211",
"github_project": "actors",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "vllm",
"specs": [
[
"==",
"0.9.1"
]
]
},
{
"name": "deepspeed",
"specs": [
[
"==",
"0.17.1"
]
]
},
{
"name": "liger-kernel",
"specs": [
[
"==",
"0.5.10"
]
]
},
{
"name": "ninja",
"specs": []
},
{
"name": "accelerate",
"specs": []
},
{
"name": "transformers",
"specs": []
},
{
"name": "datasets",
"specs": []
},
{
"name": "pynvml",
"specs": []
},
{
"name": "psutil",
"specs": []
},
{
"name": "bitsandbytes",
"specs": []
},
{
"name": "peft",
"specs": [
[
"==",
"0.15.2"
]
]
}
],
"lcname": "rl-actors"
}