<p align="center">
<img src="./docs/new_logo.png" width=90% alt="OAT" />
</p>
[](https://pypi.org/project/oat-llm)
[](https://pypi.org/project/oat-llm)
[](https://github.com/sail-sg/oat/blob/main/LICENSE)
[](https://arxiv.org/abs/2411.01493)
[Installation](#installation) | [Usage](#usage) | [Examples](./examples/) | [Citation](#citation)
---
## Updates
* 21/03/2025: We incorporate [Dr. GRPO](https://github.com/sail-sg/understand-r1-zero), which fixes the optimization bias in GRPO.
* 26/01/2025: We support reinforcement learning with verifiable rewards (RLVR) for math reasoning.
* 20/10/2024: We open source Oat, an online LLM alignment framework developed during a research project on online LLM exploration ([sample-efficient alignment](https://arxiv.org/pdf/2411.01493)).
## Introduction
Oat 🌾 is a simple yet efficient framework for running **online** LLM alignment algorithms. Its key features include:
* **High Efficiency**: Oat implements a distributed *Actor-Learner-Oracle* architecture, with each component being optimized using state-of-the-art tools:
* `Actor`: Utilizes [vLLM](https://github.com/vllm-project/vllm) for accelerated online response sampling.
* `Learner`: Leverages [DeepSpeed](https://github.com/microsoft/DeepSpeed) ZeRO strategies to enhance memory efficiency.
* `Oracle`: Model-based oracle by [Mosec](https://github.com/mosecorg/mosec) as a remote service, supporting dynamic batching, data parallelism and pipeline parallelism.
* **Simplified Workflow**: Oat simplifies the experimental pipeline of LLM alignment. With an `Oracle` served online, we can flexibly query it for preference data labeling as well as anytime model evaluation. All you need is to launch experiments and monitor real-time learning curves (e.g., win rate) on wandb (see [reproduced results](https://wandb.ai/lkevinzc/oat-llm)) — no need for manual training, checkpointing and loading for evaluation.
* **Oracle Simulation**: Oat provides a diverse set of oracles to simulate preference/reward/verification feedback.
* Verifiable rewards supported using rule-based functions.
* Lightweight reward models run within the actor's process, enabling quick testing on as few as two GPUs.
* Larger and more capable reward models can be served remotely, harnessing additional compute and memory resources.
* LLM-as-a-judge is supported via querying OpenAI API for model-based pairwise ranking.
* **Ease of Use**: Oat's modular structure allows researchers to easily inherit and modify existing classes, enabling rapid prototyping and experimentation with new algorithms.
* **Cutting-Edge Algorithms**: Oat implements state-of-the-art online algorithms, fostering innovation and fair benchmarking.
* PPO/Dr.GRPO (online RL) for math reasoning.
* Online DPO/SimPO/IPO for online preference learning.
* Online exploration (active alignment) algorithms, including [SEA](https://arxiv.org/abs/2411.01493), APL and XPO.
## Installation
In a python environment with supported versions (we recommend `3.10`), you could install oat via PyPI:
```shell
pip install vllm==0.8.4 && pip install -U oat-llm
```
Or you could also install in "editable" mode for local development:
```shell
git clone git@github.com:sail-sg/oat.git
cd oat
pip install vllm==0.8.4 && pip install -e .
```
## Usage
* R1-Zero-like training using Dr. GRPO for math reasoning: [a single-file self-contained implementation](./oat/experiment/run_math_rl.py) with [training script](./examples/math_rl.sh).
* Multi-turn SFT: [an example training script](./examples/multi_turn_sft.sh).
* Online preference learning with active exploration: [a detailed guide](./docs/alignment_as_cdb.md).
## Adopters
Research projects that are built (or integrated) with Oat 🌾:
* [Understanding R1-Zero-Like Training: A Critical Perspective](https://github.com/sail-sg/understand-r1-zero)
* [VeriFree: Reinforcing General Reasoning without Verifiers](https://github.com/sail-sg/VeriFree)
* [SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning](https://github.com/spiral-rl/spiral)
<!-- ## Benchmarking
The benchmarking compares oat with the online DPO implementation from [huggingface/trl](https://huggingface.co/docs/trl/main/en/online_dpo_trainer). Below, we outline the configurations used for oat and present the benchmarking results. Notably, oat 🌾 achieves up to **2.5x** computational efficiency compared to trl 🤗.
<p align="center">
<img src="https://gist.githubusercontent.com/lkevinzc/98afee30a5141d7068a0b35a88901a31/raw/e23f40d33e8a2fa4220e8122c152b356084b8afb/system_configs.png" width=97%/>
</p>
<p align="center">
<img src="https://gist.githubusercontent.com/lkevinzc/98afee30a5141d7068a0b35a88901a31/raw/e23f40d33e8a2fa4220e8122c152b356084b8afb/bench_results.png" width=65% />
</p>
Please refer to [Appendix C of our paper](https://arxiv.org/pdf/2411.01493#page=17.64) for a detailed discussion of the benchmarking methods and results. -->
## Citation
If you find this codebase useful for your research, please consider citing:
- LLM online alignment framework:
```bibtex
@misc{liu2024oat,
title={OAT: A research-friendly framework for LLM online alignment},
author={Liu, Zichen and Chen, Changyu and Wan, Xinyi and Du, Chao and Lee, Wee Sun and Lin, Min},
year={2024}
howpublished={\url{https://github.com/sail-sg/oat}},
}
```
- Online exploration method:
```bibtex
@article{liu2024sea,
title={Sample-Efficient Alignment for LLMs},
author={Liu, Zichen and Chen, Changyu and Du, Chao and Lee, Wee Sun and Lin, Min},
journal={arXiv preprint arXiv:2411.01493},
year={2024}
}
```
## License
`oat` is distributed under the terms of the [Apache2](https://www.apache.org/licenses/LICENSE-2.0) license.
## Acknowledgement
We thank the following awesome projects that have contributed to the development of oat:
* [vLLM](https://github.com/vllm-project/vllm)
* [DeepSpeed](https://github.com/microsoft/DeepSpeed)
* [Mosec](https://github.com/mosecorg/mosec)
* [launchpad](https://github.com/google-deepmind/launchpad)
* [OpenRLHF](https://github.com/OpenRLHF/OpenRLHF)
## Disclaimer
This is not an official Sea Limited or Garena Online Private Limited product.
Raw data
{
"_id": null,
"home_page": "https://github.com/sail-sg/oat",
"name": "oat-llm",
"maintainer": null,
"docs_url": null,
"requires_python": "<3.11,>=3.8",
"maintainer_email": null,
"keywords": "rlhf, llm reasoning, llm, ai-alignment, rl, bandit, ai, sample-efficiency",
"author": "Zichen Liu",
"author_email": "Zichen Liu <liuzc@sea.com>, Xinyi Wan <wanxy@sea.com>, Changyu Chen <chency@sea.com>",
"download_url": null,
"platform": null,
"description": "<p align=\"center\">\n <img src=\"./docs/new_logo.png\" width=90% alt=\"OAT\" />\n</p>\n\n[](https://pypi.org/project/oat-llm)\n[](https://pypi.org/project/oat-llm)\n[](https://github.com/sail-sg/oat/blob/main/LICENSE)\n[](https://arxiv.org/abs/2411.01493)\n\n[Installation](#installation) | [Usage](#usage) | [Examples](./examples/) | [Citation](#citation)\n\n---\n\n## Updates\n* 21/03/2025: We incorporate [Dr. GRPO](https://github.com/sail-sg/understand-r1-zero), which fixes the optimization bias in GRPO.\n* 26/01/2025: We support reinforcement learning with verifiable rewards (RLVR) for math reasoning.\n* 20/10/2024: We open source Oat, an online LLM alignment framework developed during a research project on online LLM exploration ([sample-efficient alignment](https://arxiv.org/pdf/2411.01493)).\n## Introduction\n\nOat \ud83c\udf3e is a simple yet efficient framework for running **online** LLM alignment algorithms. Its key features include:\n\n* **High Efficiency**: Oat implements a distributed *Actor-Learner-Oracle* architecture, with each component being optimized using state-of-the-art tools:\n * `Actor`: Utilizes [vLLM](https://github.com/vllm-project/vllm) for accelerated online response sampling.\n * `Learner`: Leverages [DeepSpeed](https://github.com/microsoft/DeepSpeed) ZeRO strategies to enhance memory efficiency.\n * `Oracle`: Model-based oracle by [Mosec](https://github.com/mosecorg/mosec) as a remote service, supporting dynamic batching, data parallelism and pipeline parallelism.\n* **Simplified Workflow**: Oat simplifies the experimental pipeline of LLM alignment. With an `Oracle` served online, we can flexibly query it for preference data labeling as well as anytime model evaluation. All you need is to launch experiments and monitor real-time learning curves (e.g., win rate) on wandb (see [reproduced results](https://wandb.ai/lkevinzc/oat-llm)) \u2014 no need for manual training, checkpointing and loading for evaluation.\n* **Oracle Simulation**: Oat provides a diverse set of oracles to simulate preference/reward/verification feedback.\n * Verifiable rewards supported using rule-based functions.\n * Lightweight reward models run within the actor's process, enabling quick testing on as few as two GPUs.\n * Larger and more capable reward models can be served remotely, harnessing additional compute and memory resources.\n * LLM-as-a-judge is supported via querying OpenAI API for model-based pairwise ranking.\n* **Ease of Use**: Oat's modular structure allows researchers to easily inherit and modify existing classes, enabling rapid prototyping and experimentation with new algorithms.\n* **Cutting-Edge Algorithms**: Oat implements state-of-the-art online algorithms, fostering innovation and fair benchmarking.\n * PPO/Dr.GRPO (online RL) for math reasoning.\n * Online DPO/SimPO/IPO for online preference learning.\n * Online exploration (active alignment) algorithms, including [SEA](https://arxiv.org/abs/2411.01493), APL and XPO.\n\n## Installation\nIn a python environment with supported versions (we recommend `3.10`), you could install oat via PyPI:\n```shell\npip install vllm==0.8.4 && pip install -U oat-llm\n```\nOr you could also install in \"editable\" mode for local development:\n```shell\ngit clone git@github.com:sail-sg/oat.git\ncd oat\npip install vllm==0.8.4 && pip install -e .\n```\n\n## Usage\n\n* R1-Zero-like training using Dr. GRPO for math reasoning: [a single-file self-contained implementation](./oat/experiment/run_math_rl.py) with [training script](./examples/math_rl.sh).\n\n* Multi-turn SFT: [an example training script](./examples/multi_turn_sft.sh).\n\n* Online preference learning with active exploration: [a detailed guide](./docs/alignment_as_cdb.md).\n\n## Adopters\nResearch projects that are built (or integrated) with Oat \ud83c\udf3e:\n* [Understanding R1-Zero-Like Training: A Critical Perspective](https://github.com/sail-sg/understand-r1-zero)\n* [VeriFree: Reinforcing General Reasoning without Verifiers](https://github.com/sail-sg/VeriFree)\n* [SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning](https://github.com/spiral-rl/spiral)\n\n<!-- ## Benchmarking\nThe benchmarking compares oat with the online DPO implementation from [huggingface/trl](https://huggingface.co/docs/trl/main/en/online_dpo_trainer). Below, we outline the configurations used for oat and present the benchmarking results. Notably, oat \ud83c\udf3e achieves up to **2.5x** computational efficiency compared to trl \ud83e\udd17.\n\n<p align=\"center\">\n <img src=\"https://gist.githubusercontent.com/lkevinzc/98afee30a5141d7068a0b35a88901a31/raw/e23f40d33e8a2fa4220e8122c152b356084b8afb/system_configs.png\" width=97%/>\n</p>\n\n<p align=\"center\">\n <img src=\"https://gist.githubusercontent.com/lkevinzc/98afee30a5141d7068a0b35a88901a31/raw/e23f40d33e8a2fa4220e8122c152b356084b8afb/bench_results.png\" width=65% />\n</p>\n\nPlease refer to [Appendix C of our paper](https://arxiv.org/pdf/2411.01493#page=17.64) for a detailed discussion of the benchmarking methods and results. -->\n\n## Citation\nIf you find this codebase useful for your research, please consider citing:\n\n- LLM online alignment framework:\n ```bibtex\n @misc{liu2024oat,\n title={OAT: A research-friendly framework for LLM online alignment},\n author={Liu, Zichen and Chen, Changyu and Wan, Xinyi and Du, Chao and Lee, Wee Sun and Lin, Min},\n year={2024}\n howpublished={\\url{https://github.com/sail-sg/oat}},\n }\n ```\n\n- Online exploration method:\n ```bibtex\n @article{liu2024sea,\n title={Sample-Efficient Alignment for LLMs},\n author={Liu, Zichen and Chen, Changyu and Du, Chao and Lee, Wee Sun and Lin, Min},\n journal={arXiv preprint arXiv:2411.01493},\n year={2024}\n }\n ```\n\n## License\n\n`oat` is distributed under the terms of the [Apache2](https://www.apache.org/licenses/LICENSE-2.0) license.\n\n## Acknowledgement\nWe thank the following awesome projects that have contributed to the development of oat:\n* [vLLM](https://github.com/vllm-project/vllm)\n* [DeepSpeed](https://github.com/microsoft/DeepSpeed)\n* [Mosec](https://github.com/mosecorg/mosec)\n* [launchpad](https://github.com/google-deepmind/launchpad)\n* [OpenRLHF](https://github.com/OpenRLHF/OpenRLHF)\n\n## Disclaimer\n\nThis is not an official Sea Limited or Garena Online Private Limited product.\n",
"bugtrack_url": null,
"license": "Apache-2.0",
"summary": "Online AlignmenT (OAT) for LLMs.",
"version": "0.2.1",
"project_urls": {
"Documentation": "https://github.com/sail-sg/oat#readme",
"Homepage": "https://github.com/sail-sg/oat",
"Issues": "https://github.com/sail-sg/oat/issues",
"Source": "https://github.com/sail-sg/oat"
},
"split_keywords": [
"rlhf",
" llm reasoning",
" llm",
" ai-alignment",
" rl",
" bandit",
" ai",
" sample-efficiency"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "f25c2b850ca415e9608ec370ef5759c89f1eb6d693571b484b51cd5bb7e5958e",
"md5": "9a0e90d723e65169a94fc8e321f41ac5",
"sha256": "29d1406552e8b11c2135e76640698fe3bfac114581602ee61081cebc641880ed"
},
"downloads": -1,
"filename": "oat_llm-0.2.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "9a0e90d723e65169a94fc8e321f41ac5",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<3.11,>=3.8",
"size": 146151,
"upload_time": "2025-08-24T06:20:27",
"upload_time_iso_8601": "2025-08-24T06:20:27.555635Z",
"url": "https://files.pythonhosted.org/packages/f2/5c/2b850ca415e9608ec370ef5759c89f1eb6d693571b484b51cd5bb7e5958e/oat_llm-0.2.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-24 06:20:27",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "sail-sg",
"github_project": "oat",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "oat-llm"
}