oat-llm


Nameoat-llm JSON
Version 0.0.6 PyPI version JSON
download
home_pagehttps://github.com/sail-sg/oat
SummaryOnline AlignmenT (OAT) for LLMs.
upload_time2025-01-26 02:45:11
maintainerNone
docs_urlNone
authorZichen Liu
requires_python<3.11,>=3.8
licenseApache-2.0
keywords rlhf llm ai-alignment rl bandit ai sample-efficiency
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <p align="center">
  <img src="./docs/new_logo.png" width=90% alt="OAT" />
</p>

[![PyPI - Version](https://img.shields.io/pypi/v/oat-llm.svg)](https://pypi.org/project/oat-llm)
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/oat-llm.svg)](https://pypi.org/project/oat-llm)
[![License](https://img.shields.io/github/license/sail-sg/oat)](https://github.com/sail-sg/oat/blob/main/LICENSE)
[![arXiv](https://img.shields.io/badge/arXiv-2411.01493-b31b1b.svg)](https://arxiv.org/abs/2411.01493)

[Installation](#installation) | [Usage](#usage) | [Examples](./examples/) | [Benchmarking](#benchmarking) | [Citation](#citation)

---

## Updates
* 26/01/2025: We support reinforcement learning with verifiable rewards (RLVR) for math reasoning.

## Introduction

Oat 🌾 is a simple yet efficient framework for running **online** LLM alignment algorithms. Its key features include:

* **High Efficiency**: Oat implements a distributed *Actor-Learner-Oracle* architecture, with each component being optimized using state-of-the-art tools:
  * `Actor`: Utilizes [vLLM](https://github.com/vllm-project/vllm) for accelerated online response sampling.
  * `Learner`: Leverages [DeepSpeed](https://github.com/microsoft/DeepSpeed) ZeRO strategies to enhance memory efficiency.
  * `Oracle`: Model-based oracle by [Mosec](https://github.com/mosecorg/mosec) as a remote service, supporting dynamic batching, data parallelism and pipeline parallelism.
* **Simplified Workflow**: Oat simplifies the experimental pipeline of LLM alignment. With an `Oracle` served online, we can flexibly query it for preference data labeling as well as anytime model evaluation. All you need is to launch experiments and monitor real-time learning curves (e.g., win rate) on wandb (see [reproduced results](https://wandb.ai/lkevinzc/oat-llm)) — no need for manual training, checkpointing and loading for evaluation.
* **Oracle Simulation**: Oat provides a diverse set of oracles to simulate preference/reward/verification feedback.
  * Verifiable rewards supported using rule-based functions.
  * Lightweight reward models run within the actor's process, enabling quick testing on as few as two GPUs.
  * Larger and more capable reward models can be served remotely, harnessing additional compute and memory resources.
  * LLM-as-a-judge is supported via querying OpenAI API for model-based pairwise ranking.
* **Ease of Use**: Oat's modular structure allows researchers to easily inherit and modify existing classes, enabling rapid prototyping and experimentation with new algorithms.
* **Cutting-Edge Algorithms**: Oat implements state-of-the-art online algorithms, fostering innovation and fair benchmarking.
  * PPO (online RL) for math reasoning.
  * Online DPO/SimPO/IPO for online preference learning.
  * Online exploration (active alignment) algorithms, including [SEA](https://arxiv.org/abs/2411.01493), APL and XPO.

## Installation
In a python environment with supported versions (`>=3.8, <=3.10`), you could install oat via PyPI:
```shell
pip install vllm==0.6.2 && pip install oat-llm
```
Or you could also install in "editable" mode for local development:
```shell
git clone git@github.com:sail-sg/oat.git
cd oat
pip install vllm==0.6.2 && pip install -e .
```

##  Usage
* [Improving math reasoning with PPO](./docs/reasoning_examples.md).
* [Online preference learning with active exploration](./docs/alignment_as_cdb.md).

## Benchmarking
The benchmarking compares oat with the online DPO implementation from [huggingface/trl](https://huggingface.co/docs/trl/main/en/online_dpo_trainer). Below, we outline the configurations used for oat and present the benchmarking results. Notably, oat 🌾 achieves up to **2.5x** computational efficiency compared to trl 🤗.

<p align="center">
  <img src="https://gist.githubusercontent.com/lkevinzc/98afee30a5141d7068a0b35a88901a31/raw/e23f40d33e8a2fa4220e8122c152b356084b8afb/system_configs.png" width=97%/>
</p>

<p align="center">
  <img src="https://gist.githubusercontent.com/lkevinzc/98afee30a5141d7068a0b35a88901a31/raw/e23f40d33e8a2fa4220e8122c152b356084b8afb/bench_results.png" width=65% />
</p>

Please refer to [Appendix C of our paper](https://arxiv.org/pdf/2411.01493#page=17.64) for a detailed discussion of the benchmarking methods and results.

## Citation
If you find this codebase useful for your research, please consider citing
```
@misc{liu2025oat,
author       = {Zichen Liu and Changyu Chen and Chao Du and Wee Sun Lee and Min Lin},
title        = {OAT: A research-friendly framework for LLM online alignment},
howpublished = {[https://github.com/sail-sg/oat](https://github.com/sail-sg/oat)},
year         = {2025}
}
```

```
@article{
  liu2024sea,
  title={Sample-Efficient Alignment for LLMs},
  author={Zichen Liu and Changyu Chen and Chao Du and Wee Sun Lee and Min Lin},
  journal={arXiv preprint arXiv:2411.01493},
  year={2024}
}
```

## License

`oat` is distributed under the terms of the [Apache2](https://www.apache.org/licenses/LICENSE-2.0) license.

## Acknowledgement
We thank the following awesome projects that have contributed to the development of oat:
* [vLLM](https://github.com/vllm-project/vllm)
* [DeepSpeed](https://github.com/microsoft/DeepSpeed)
* [Mosec](https://github.com/mosecorg/mosec)
* [launchpad](https://github.com/google-deepmind/launchpad)
* [OpenRLHF](https://github.com/OpenRLHF/OpenRLHF)

## Disclaimer

This is not an official Sea Limited or Garena Online Private Limited product.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/sail-sg/oat",
    "name": "oat-llm",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<3.11,>=3.8",
    "maintainer_email": null,
    "keywords": "rlhf, llm, ai-alignment, rl, bandit, ai, sample-efficiency",
    "author": "Zichen Liu",
    "author_email": "Zichen Liu <liuzc@sea.com>, Changyu Chen <chency@sea.com>",
    "download_url": null,
    "platform": null,
    "description": "<p align=\"center\">\n  <img src=\"./docs/new_logo.png\" width=90% alt=\"OAT\" />\n</p>\n\n[![PyPI - Version](https://img.shields.io/pypi/v/oat-llm.svg)](https://pypi.org/project/oat-llm)\n[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/oat-llm.svg)](https://pypi.org/project/oat-llm)\n[![License](https://img.shields.io/github/license/sail-sg/oat)](https://github.com/sail-sg/oat/blob/main/LICENSE)\n[![arXiv](https://img.shields.io/badge/arXiv-2411.01493-b31b1b.svg)](https://arxiv.org/abs/2411.01493)\n\n[Installation](#installation) | [Usage](#usage) | [Examples](./examples/) | [Benchmarking](#benchmarking) | [Citation](#citation)\n\n---\n\n## Updates\n* 26/01/2025: We support reinforcement learning with verifiable rewards (RLVR) for math reasoning.\n\n## Introduction\n\nOat \ud83c\udf3e is a simple yet efficient framework for running **online** LLM alignment algorithms. Its key features include:\n\n* **High Efficiency**: Oat implements a distributed *Actor-Learner-Oracle* architecture, with each component being optimized using state-of-the-art tools:\n  * `Actor`: Utilizes [vLLM](https://github.com/vllm-project/vllm) for accelerated online response sampling.\n  * `Learner`: Leverages [DeepSpeed](https://github.com/microsoft/DeepSpeed) ZeRO strategies to enhance memory efficiency.\n  * `Oracle`: Model-based oracle by [Mosec](https://github.com/mosecorg/mosec) as a remote service, supporting dynamic batching, data parallelism and pipeline parallelism.\n* **Simplified Workflow**: Oat simplifies the experimental pipeline of LLM alignment. With an `Oracle` served online, we can flexibly query it for preference data labeling as well as anytime model evaluation. All you need is to launch experiments and monitor real-time learning curves (e.g., win rate) on wandb (see [reproduced results](https://wandb.ai/lkevinzc/oat-llm)) \u2014 no need for manual training, checkpointing and loading for evaluation.\n* **Oracle Simulation**: Oat provides a diverse set of oracles to simulate preference/reward/verification feedback.\n  * Verifiable rewards supported using rule-based functions.\n  * Lightweight reward models run within the actor's process, enabling quick testing on as few as two GPUs.\n  * Larger and more capable reward models can be served remotely, harnessing additional compute and memory resources.\n  * LLM-as-a-judge is supported via querying OpenAI API for model-based pairwise ranking.\n* **Ease of Use**: Oat's modular structure allows researchers to easily inherit and modify existing classes, enabling rapid prototyping and experimentation with new algorithms.\n* **Cutting-Edge Algorithms**: Oat implements state-of-the-art online algorithms, fostering innovation and fair benchmarking.\n  * PPO (online RL) for math reasoning.\n  * Online DPO/SimPO/IPO for online preference learning.\n  * Online exploration (active alignment) algorithms, including [SEA](https://arxiv.org/abs/2411.01493), APL and XPO.\n\n## Installation\nIn a python environment with supported versions (`>=3.8, <=3.10`), you could install oat via PyPI:\n```shell\npip install vllm==0.6.2 && pip install oat-llm\n```\nOr you could also install in \"editable\" mode for local development:\n```shell\ngit clone git@github.com:sail-sg/oat.git\ncd oat\npip install vllm==0.6.2 && pip install -e .\n```\n\n##  Usage\n* [Improving math reasoning with PPO](./docs/reasoning_examples.md).\n* [Online preference learning with active exploration](./docs/alignment_as_cdb.md).\n\n## Benchmarking\nThe benchmarking compares oat with the online DPO implementation from [huggingface/trl](https://huggingface.co/docs/trl/main/en/online_dpo_trainer). Below, we outline the configurations used for oat and present the benchmarking results. Notably, oat \ud83c\udf3e achieves up to **2.5x** computational efficiency compared to trl \ud83e\udd17.\n\n<p align=\"center\">\n  <img src=\"https://gist.githubusercontent.com/lkevinzc/98afee30a5141d7068a0b35a88901a31/raw/e23f40d33e8a2fa4220e8122c152b356084b8afb/system_configs.png\" width=97%/>\n</p>\n\n<p align=\"center\">\n  <img src=\"https://gist.githubusercontent.com/lkevinzc/98afee30a5141d7068a0b35a88901a31/raw/e23f40d33e8a2fa4220e8122c152b356084b8afb/bench_results.png\" width=65% />\n</p>\n\nPlease refer to [Appendix C of our paper](https://arxiv.org/pdf/2411.01493#page=17.64) for a detailed discussion of the benchmarking methods and results.\n\n## Citation\nIf you find this codebase useful for your research, please consider citing\n```\n@misc{liu2025oat,\nauthor       = {Zichen Liu and Changyu Chen and Chao Du and Wee Sun Lee and Min Lin},\ntitle        = {OAT: A research-friendly framework for LLM online alignment},\nhowpublished = {[https://github.com/sail-sg/oat](https://github.com/sail-sg/oat)},\nyear         = {2025}\n}\n```\n\n```\n@article{\n  liu2024sea,\n  title={Sample-Efficient Alignment for LLMs},\n  author={Zichen Liu and Changyu Chen and Chao Du and Wee Sun Lee and Min Lin},\n  journal={arXiv preprint arXiv:2411.01493},\n  year={2024}\n}\n```\n\n## License\n\n`oat` is distributed under the terms of the [Apache2](https://www.apache.org/licenses/LICENSE-2.0) license.\n\n## Acknowledgement\nWe thank the following awesome projects that have contributed to the development of oat:\n* [vLLM](https://github.com/vllm-project/vllm)\n* [DeepSpeed](https://github.com/microsoft/DeepSpeed)\n* [Mosec](https://github.com/mosecorg/mosec)\n* [launchpad](https://github.com/google-deepmind/launchpad)\n* [OpenRLHF](https://github.com/OpenRLHF/OpenRLHF)\n\n## Disclaimer\n\nThis is not an official Sea Limited or Garena Online Private Limited product.\n",
    "bugtrack_url": null,
    "license": "Apache-2.0",
    "summary": "Online AlignmenT (OAT) for LLMs.",
    "version": "0.0.6",
    "project_urls": {
        "Documentation": "https://github.com/sail-sg/oat#readme",
        "Homepage": "https://github.com/sail-sg/oat",
        "Issues": "https://github.com/sail-sg/oat/issues",
        "Source": "https://github.com/sail-sg/oat"
    },
    "split_keywords": [
        "rlhf",
        " llm",
        " ai-alignment",
        " rl",
        " bandit",
        " ai",
        " sample-efficiency"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "66070b438800543ab42aa9192a020d7715f640f52fcf22e6d4fca2c9739bd43b",
                "md5": "fc7e8d75d2b3f299c61ed74dc613e59b",
                "sha256": "6ddd7f08b74cb1a1c4df76a4dcd6f3ff64824958f7ffad260b81a3d0d8a988b2"
            },
            "downloads": -1,
            "filename": "oat_llm-0.0.6-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "fc7e8d75d2b3f299c61ed74dc613e59b",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<3.11,>=3.8",
            "size": 116401,
            "upload_time": "2025-01-26T02:45:11",
            "upload_time_iso_8601": "2025-01-26T02:45:11.220204Z",
            "url": "https://files.pythonhosted.org/packages/66/07/0b438800543ab42aa9192a020d7715f640f52fcf22e6d4fca2c9739bd43b/oat_llm-0.0.6-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-01-26 02:45:11",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "sail-sg",
    "github_project": "oat",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "oat-llm"
}
        
Elapsed time: 0.90928s