# TRL - Transformer Reinforcement Learning
<div style="text-align: center">
<img src="https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/trl_banner_dark.png" alt="TRL Banner">
</div>
<hr> <br>
<h3 align="center">
<p>A comprehensive library to post-train foundation models</p>
</h3>
<p align="center">
<a href="https://github.com/huggingface/trl/blob/main/LICENSE"><img alt="License" src="https://img.shields.io/github/license/huggingface/trl.svg?color=blue"></a>
<a href="https://huggingface.co/docs/trl/index"><img alt="Documentation" src="https://img.shields.io/website?label=documentation&url=https%3A%2F%2Fhuggingface.co%2Fdocs%2Ftrl%2Findex&down_color=red&down_message=offline&up_color=blue&up_message=online"></a>
<a href="https://github.com/huggingface/trl/releases"><img alt="GitHub release" src="https://img.shields.io/github/release/huggingface/trl.svg"></a>
<a href="https://huggingface.co/trl-lib"><img alt="Hugging Face Hub" src="https://img.shields.io/badge/🤗%20Hub-trl--lib-yellow"></a>
</p>
## 🎉 What's New
> **✨ Open AI GPT OSS Support**: TRL now fully supports fine-tuning the latest [OpenAI GPT OSS models](https://huggingface.co/collections/openai/gpt-oss-68911959590a1634ba11c7a4)! Check out the
>
> - [OpenAI Cookbook](https://cookbook.openai.com/articles/gpt-oss/fine-tune-transfomers)
> - [GPT OSS receipes](https://github.com/huggingface/gpt-oss-recipes)
> - [Our example script](https://github.com/huggingface/trl/blob/main/examples/scripts/sft_gpt_oss.py)
## Overview
TRL is a cutting-edge library designed for post-training foundation models using advanced techniques like Supervised Fine-Tuning (SFT), Proximal Policy Optimization (PPO), and Direct Preference Optimization (DPO). Built on top of the [🤗 Transformers](https://github.com/huggingface/transformers) ecosystem, TRL supports a variety of model architectures and modalities, and can be scaled-up across various hardware setups.
## Highlights
- **Trainers**: Various fine-tuning methods are easily accessible via trainers like [`SFTTrainer`](https://huggingface.co/docs/trl/sft_trainer), [`GRPOTrainer`](https://huggingface.co/docs/trl/grpo_trainer), [`DPOTrainer`](https://huggingface.co/docs/trl/dpo_trainer), [`RewardTrainer`](https://huggingface.co/docs/trl/reward_trainer) and more.
- **Efficient and scalable**:
- Leverages [🤗 Accelerate](https://github.com/huggingface/accelerate) to scale from single GPU to multi-node clusters using methods like [DDP](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html) and [DeepSpeed](https://github.com/deepspeedai/DeepSpeed).
- Full integration with [🤗 PEFT](https://github.com/huggingface/peft) enables training on large models with modest hardware via quantization and LoRA/QLoRA.
- Integrates [🦥 Unsloth](https://github.com/unslothai/unsloth) for accelerating training using optimized kernels.
- **Command Line Interface (CLI)**: A simple interface lets you fine-tune with models without needing to write code.
## Installation
### Python Package
Install the library using `pip`:
```bash
pip install trl
```
### From source
If you want to use the latest features before an official release, you can install TRL from source:
```bash
pip install git+https://github.com/huggingface/trl.git
```
### Repository
If you want to use the examples you can clone the repository with the following command:
```bash
git clone https://github.com/huggingface/trl.git
```
## Quick Start
For more flexibility and control over training, TRL provides dedicated trainer classes to post-train language models or PEFT adapters on a custom dataset. Each trainer in TRL is a light wrapper around the 🤗 Transformers trainer and natively supports distributed training methods like DDP, DeepSpeed ZeRO, and FSDP.
### `SFTTrainer`
Here is a basic example of how to use the [`SFTTrainer`](https://huggingface.co/docs/trl/sft_trainer):
```python
from trl import SFTTrainer
from datasets import load_dataset
dataset = load_dataset("trl-lib/Capybara", split="train")
trainer = SFTTrainer(
model="Qwen/Qwen2.5-0.5B",
train_dataset=dataset,
)
trainer.train()
```
### `GRPOTrainer`
[`GRPOTrainer`](https://huggingface.co/docs/trl/grpo_trainer) implements the [Group Relative Policy Optimization (GRPO) algorithm](https://huggingface.co/papers/2402.03300) that is more memory-efficient than PPO and was used to train [Deepseek AI's R1](https://huggingface.co/deepseek-ai/DeepSeek-R1).
```python
from datasets import load_dataset
from trl import GRPOTrainer
dataset = load_dataset("trl-lib/tldr", split="train")
# Dummy reward function: count the number of unique characters in the completions
def reward_num_unique_chars(completions, **kwargs):
return [len(set(c)) for c in completions]
trainer = GRPOTrainer(
model="Qwen/Qwen2-0.5B-Instruct",
reward_funcs=reward_num_unique_chars,
train_dataset=dataset,
)
trainer.train()
```
### `DPOTrainer`
[`DPOTrainer`](https://huggingface.co/docs/trl/dpo_trainer) implements the popular [Direct Preference Optimization (DPO) algorithm](https://huggingface.co/papers/2305.18290) that was used to post-train [Llama 3](https://huggingface.co/papers/2407.21783) and many other models. Here is a basic example of how to use the `DPOTrainer`:
```python
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import DPOConfig, DPOTrainer
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
training_args = DPOConfig(output_dir="Qwen2.5-0.5B-DPO")
trainer = DPOTrainer(
model=model,
args=training_args,
train_dataset=dataset,
processing_class=tokenizer
)
trainer.train()
```
### `RewardTrainer`
Here is a basic example of how to use the [`RewardTrainer`](https://huggingface.co/docs/trl/reward_trainer):
```python
from trl import RewardConfig, RewardTrainer
from datasets import load_dataset
from transformers import AutoModelForSequenceClassification, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
model = AutoModelForSequenceClassification.from_pretrained(
"Qwen/Qwen2.5-0.5B-Instruct", num_labels=1
)
model.config.pad_token_id = tokenizer.pad_token_id
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
training_args = RewardConfig(output_dir="Qwen2.5-0.5B-Reward", per_device_train_batch_size=2)
trainer = RewardTrainer(
args=training_args,
model=model,
processing_class=tokenizer,
train_dataset=dataset,
)
trainer.train()
```
## Command Line Interface (CLI)
You can use the TRL Command Line Interface (CLI) to quickly get started with post-training methods like Supervised Fine-Tuning (SFT) or Direct Preference Optimization (DPO):
**SFT:**
```bash
trl sft --model_name_or_path Qwen/Qwen2.5-0.5B \
--dataset_name trl-lib/Capybara \
--output_dir Qwen2.5-0.5B-SFT
```
**DPO:**
```bash
trl dpo --model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
--dataset_name argilla/Capybara-Preferences \
--output_dir Qwen2.5-0.5B-DPO
```
Read more about CLI in the [relevant documentation section](https://huggingface.co/docs/trl/main/en/clis) or use `--help` for more details.
## Development
If you want to contribute to `trl` or customize it to your needs make sure to read the [contribution guide](https://github.com/huggingface/trl/blob/main/CONTRIBUTING.md) and make sure you make a dev install:
```bash
git clone https://github.com/huggingface/trl.git
cd trl/
pip install -e .[dev]
```
## Citation
```bibtex
@misc{vonwerra2022trl,
author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallouédec},
title = {TRL: Transformer Reinforcement Learning},
year = {2020},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/huggingface/trl}}
}
```
## License
This repository's source code is available under the [Apache-2.0 License](LICENSE).
Raw data
{
"_id": null,
"home_page": "https://github.com/huggingface/trl",
"name": "trl",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "transformers, huggingface, language modeling, post-training, rlhf, sft, dpo, grpo",
"author": "Leandro von Werra",
"author_email": "leandro.vonwerra@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/aa/bd/f25e4358287d3feabb815106edea902c93b85331beafeca358e28e23ba3a/trl-0.21.0.tar.gz",
"platform": null,
"description": "# TRL - Transformer Reinforcement Learning\n\n<div style=\"text-align: center\">\n<img src=\"https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/trl_banner_dark.png\" alt=\"TRL Banner\">\n</div>\n\n<hr> <br>\n\n<h3 align=\"center\">\n <p>A comprehensive library to post-train foundation models</p>\n</h3>\n\n<p align=\"center\">\n <a href=\"https://github.com/huggingface/trl/blob/main/LICENSE\"><img alt=\"License\" src=\"https://img.shields.io/github/license/huggingface/trl.svg?color=blue\"></a>\n <a href=\"https://huggingface.co/docs/trl/index\"><img alt=\"Documentation\" src=\"https://img.shields.io/website?label=documentation&url=https%3A%2F%2Fhuggingface.co%2Fdocs%2Ftrl%2Findex&down_color=red&down_message=offline&up_color=blue&up_message=online\"></a>\n <a href=\"https://github.com/huggingface/trl/releases\"><img alt=\"GitHub release\" src=\"https://img.shields.io/github/release/huggingface/trl.svg\"></a>\n <a href=\"https://huggingface.co/trl-lib\"><img alt=\"Hugging Face Hub\" src=\"https://img.shields.io/badge/\ud83e\udd17%20Hub-trl--lib-yellow\"></a> \n</p>\n\n## \ud83c\udf89 What's New\n\n> **\u2728 Open AI GPT OSS Support**: TRL now fully supports fine-tuning the latest [OpenAI GPT OSS models](https://huggingface.co/collections/openai/gpt-oss-68911959590a1634ba11c7a4)! Check out the\n>\n> - [OpenAI Cookbook](https://cookbook.openai.com/articles/gpt-oss/fine-tune-transfomers)\n> - [GPT OSS receipes](https://github.com/huggingface/gpt-oss-recipes)\n> - [Our example script](https://github.com/huggingface/trl/blob/main/examples/scripts/sft_gpt_oss.py)\n\n## Overview\n\nTRL is a cutting-edge library designed for post-training foundation models using advanced techniques like Supervised Fine-Tuning (SFT), Proximal Policy Optimization (PPO), and Direct Preference Optimization (DPO). Built on top of the [\ud83e\udd17 Transformers](https://github.com/huggingface/transformers) ecosystem, TRL supports a variety of model architectures and modalities, and can be scaled-up across various hardware setups.\n\n## Highlights\n\n- **Trainers**: Various fine-tuning methods are easily accessible via trainers like [`SFTTrainer`](https://huggingface.co/docs/trl/sft_trainer), [`GRPOTrainer`](https://huggingface.co/docs/trl/grpo_trainer), [`DPOTrainer`](https://huggingface.co/docs/trl/dpo_trainer), [`RewardTrainer`](https://huggingface.co/docs/trl/reward_trainer) and more.\n\n- **Efficient and scalable**: \n - Leverages [\ud83e\udd17 Accelerate](https://github.com/huggingface/accelerate) to scale from single GPU to multi-node clusters using methods like [DDP](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html) and [DeepSpeed](https://github.com/deepspeedai/DeepSpeed).\n - Full integration with [\ud83e\udd17 PEFT](https://github.com/huggingface/peft) enables training on large models with modest hardware via quantization and LoRA/QLoRA.\n - Integrates [\ud83e\udda5 Unsloth](https://github.com/unslothai/unsloth) for accelerating training using optimized kernels.\n\n- **Command Line Interface (CLI)**: A simple interface lets you fine-tune with models without needing to write code.\n\n## Installation\n\n### Python Package\n\nInstall the library using `pip`:\n\n```bash\npip install trl\n```\n\n### From source\n\nIf you want to use the latest features before an official release, you can install TRL from source:\n\n```bash\npip install git+https://github.com/huggingface/trl.git\n```\n\n### Repository\n\nIf you want to use the examples you can clone the repository with the following command:\n\n```bash\ngit clone https://github.com/huggingface/trl.git\n```\n\n## Quick Start\n\n\nFor more flexibility and control over training, TRL provides dedicated trainer classes to post-train language models or PEFT adapters on a custom dataset. Each trainer in TRL is a light wrapper around the \ud83e\udd17 Transformers trainer and natively supports distributed training methods like DDP, DeepSpeed ZeRO, and FSDP.\n\n### `SFTTrainer`\n\nHere is a basic example of how to use the [`SFTTrainer`](https://huggingface.co/docs/trl/sft_trainer):\n\n```python\nfrom trl import SFTTrainer\nfrom datasets import load_dataset\n\ndataset = load_dataset(\"trl-lib/Capybara\", split=\"train\")\n\ntrainer = SFTTrainer(\n model=\"Qwen/Qwen2.5-0.5B\",\n train_dataset=dataset,\n)\ntrainer.train()\n```\n\n### `GRPOTrainer`\n\n[`GRPOTrainer`](https://huggingface.co/docs/trl/grpo_trainer) implements the [Group Relative Policy Optimization (GRPO) algorithm](https://huggingface.co/papers/2402.03300) that is more memory-efficient than PPO and was used to train [Deepseek AI's R1](https://huggingface.co/deepseek-ai/DeepSeek-R1).\n\n```python\nfrom datasets import load_dataset\nfrom trl import GRPOTrainer\n\ndataset = load_dataset(\"trl-lib/tldr\", split=\"train\")\n\n# Dummy reward function: count the number of unique characters in the completions\ndef reward_num_unique_chars(completions, **kwargs):\n return [len(set(c)) for c in completions]\n\ntrainer = GRPOTrainer(\n model=\"Qwen/Qwen2-0.5B-Instruct\",\n reward_funcs=reward_num_unique_chars,\n train_dataset=dataset,\n)\ntrainer.train()\n```\n\n### `DPOTrainer`\n\n[`DPOTrainer`](https://huggingface.co/docs/trl/dpo_trainer) implements the popular [Direct Preference Optimization (DPO) algorithm](https://huggingface.co/papers/2305.18290) that was used to post-train [Llama 3](https://huggingface.co/papers/2407.21783) and many other models. Here is a basic example of how to use the `DPOTrainer`:\n\n```python\nfrom datasets import load_dataset\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\nfrom trl import DPOConfig, DPOTrainer\n\nmodel = AutoModelForCausalLM.from_pretrained(\"Qwen/Qwen2.5-0.5B-Instruct\")\ntokenizer = AutoTokenizer.from_pretrained(\"Qwen/Qwen2.5-0.5B-Instruct\")\ndataset = load_dataset(\"trl-lib/ultrafeedback_binarized\", split=\"train\")\ntraining_args = DPOConfig(output_dir=\"Qwen2.5-0.5B-DPO\")\ntrainer = DPOTrainer(\n model=model,\n args=training_args,\n train_dataset=dataset,\n processing_class=tokenizer\n)\ntrainer.train()\n```\n\n### `RewardTrainer`\n\nHere is a basic example of how to use the [`RewardTrainer`](https://huggingface.co/docs/trl/reward_trainer):\n\n```python\nfrom trl import RewardConfig, RewardTrainer\nfrom datasets import load_dataset\nfrom transformers import AutoModelForSequenceClassification, AutoTokenizer\n\ntokenizer = AutoTokenizer.from_pretrained(\"Qwen/Qwen2.5-0.5B-Instruct\")\nmodel = AutoModelForSequenceClassification.from_pretrained(\n \"Qwen/Qwen2.5-0.5B-Instruct\", num_labels=1\n)\nmodel.config.pad_token_id = tokenizer.pad_token_id\n\ndataset = load_dataset(\"trl-lib/ultrafeedback_binarized\", split=\"train\")\n\ntraining_args = RewardConfig(output_dir=\"Qwen2.5-0.5B-Reward\", per_device_train_batch_size=2)\ntrainer = RewardTrainer(\n args=training_args,\n model=model,\n processing_class=tokenizer,\n train_dataset=dataset,\n)\ntrainer.train()\n```\n\n## Command Line Interface (CLI)\n\nYou can use the TRL Command Line Interface (CLI) to quickly get started with post-training methods like Supervised Fine-Tuning (SFT) or Direct Preference Optimization (DPO):\n\n**SFT:**\n\n```bash\ntrl sft --model_name_or_path Qwen/Qwen2.5-0.5B \\\n --dataset_name trl-lib/Capybara \\\n --output_dir Qwen2.5-0.5B-SFT\n```\n\n**DPO:**\n\n```bash\ntrl dpo --model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \\\n --dataset_name argilla/Capybara-Preferences \\\n --output_dir Qwen2.5-0.5B-DPO \n```\n\nRead more about CLI in the [relevant documentation section](https://huggingface.co/docs/trl/main/en/clis) or use `--help` for more details.\n\n## Development\n\nIf you want to contribute to `trl` or customize it to your needs make sure to read the [contribution guide](https://github.com/huggingface/trl/blob/main/CONTRIBUTING.md) and make sure you make a dev install:\n\n```bash\ngit clone https://github.com/huggingface/trl.git\ncd trl/\npip install -e .[dev]\n```\n\n## Citation\n\n```bibtex\n@misc{vonwerra2022trl,\n author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou\u00e9dec},\n title = {TRL: Transformer Reinforcement Learning},\n year = {2020},\n publisher = {GitHub},\n journal = {GitHub repository},\n howpublished = {\\url{https://github.com/huggingface/trl}}\n}\n```\n\n## License\n\nThis repository's source code is available under the [Apache-2.0 License](LICENSE).\n",
"bugtrack_url": null,
"license": null,
"summary": "Train transformer language models with reinforcement learning.",
"version": "0.21.0",
"project_urls": {
"Homepage": "https://github.com/huggingface/trl"
},
"split_keywords": [
"transformers",
" huggingface",
" language modeling",
" post-training",
" rlhf",
" sft",
" dpo",
" grpo"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "59bf423e0c0b013d33ccbff8ea31ce97ee8211b72ebace1a205ae92dcb67682a",
"md5": "90f8765829fcf3e3a132e87f93d10a3d",
"sha256": "0ac51a49290eb3a188dc187da5bbc95d9edc61395b4a3b4702337dd0cd890e0d"
},
"downloads": -1,
"filename": "trl-0.21.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "90f8765829fcf3e3a132e87f93d10a3d",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 511944,
"upload_time": "2025-08-05T16:51:07",
"upload_time_iso_8601": "2025-08-05T16:51:07.128416Z",
"url": "https://files.pythonhosted.org/packages/59/bf/423e0c0b013d33ccbff8ea31ce97ee8211b72ebace1a205ae92dcb67682a/trl-0.21.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "aabdf25e4358287d3feabb815106edea902c93b85331beafeca358e28e23ba3a",
"md5": "cc2d9baa577e4887ed65b97ef59282e3",
"sha256": "84dec571e47520e64ccedda94623df13e723fbbd689fd80bf524b9eb0b6d6ca8"
},
"downloads": -1,
"filename": "trl-0.21.0.tar.gz",
"has_sig": false,
"md5_digest": "cc2d9baa577e4887ed65b97ef59282e3",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 462967,
"upload_time": "2025-08-05T16:51:08",
"upload_time_iso_8601": "2025-08-05T16:51:08.342233Z",
"url": "https://files.pythonhosted.org/packages/aa/bd/f25e4358287d3feabb815106edea902c93b85331beafeca358e28e23ba3a/trl-0.21.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-05 16:51:08",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "huggingface",
"github_project": "trl",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "accelerate",
"specs": [
[
">=",
"1.4.0"
]
]
},
{
"name": "datasets",
"specs": [
[
">=",
"3.0.0"
]
]
},
{
"name": "transformers",
"specs": [
[
">=",
"4.55.0"
]
]
}
],
"lcname": "trl"
}