trl-fpo

Name	trl-fpo JSON
Version	0.0.14 JSON
	download
home_page	https://github.com/huggingface/trl
Summary	Train transformer language models with reinforcement learning.
upload_time	2025-01-18 04:51:57
maintainer	None
docs_url	None
author	Rajarshi, Gurpreet, Danush
requires_python	>=3.7
license	Apache 2.0
keywords	ppo transformers huggingface gpt2 language modeling rlhf
VCS
bugtrack_url
requirements	accelerate datasets rich transformers
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            <div style="text-align: center">
<img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/trl_banner_dark.png">
</div>

# TRL - Transformer Reinforcement Learning
> Full stack library to fine-tune and align large language models.

<p align="center">
    <a href="https://github.com/huggingface/trl/blob/main/LICENSE">
        <img alt="License" src="https://img.shields.io/github/license/huggingface/trl.svg?color=blue">
    </a>
    <a href="https://huggingface.co/docs/trl/index">
        <img alt="Documentation" src="https://img.shields.io/website/http/huggingface.co/docs/trl/index.svg?down_color=red&down_message=offline&up_message=online">
    </a>
    <a href="https://github.com/huggingface/trl/releases">
        <img alt="GitHub release" src="https://img.shields.io/github/release/huggingface/trl.svg">
    </a>
</p>


## What is it?

The `trl` library is a full stack tool to fine-tune and align transformer language and diffusion models using methods such as Supervised Fine-tuning step (SFT), Reward Modeling (RM) and the Proximal Policy Optimization (PPO) as well as Direct Preference Optimization (DPO). 

The library is built on top of the [`transformers`](https://github.com/huggingface/transformers) library and thus allows to use any model architecture available there.


## Highlights

- **`Efficient and scalable`**: 
    - [`accelerate`](https://github.com/huggingface/accelerate) is the backbone of `trl` which allows to scale model training from a single GPU to a large scale multi-node cluster with methods such as DDP and DeepSpeed.
    - [`PEFT`](https://github.com/huggingface/peft) is fully integrated and allows to train even the largest models on modest hardware with quantisation and methods such as LoRA or QLoRA.
    - [`unsloth`](https://github.com/unslothai/unsloth) is also integrated and allows to significantly speed up training with dedicated kernels.
- **`CLI`**: With the [CLI](https://huggingface.co/docs/trl/clis) you can fine-tune and chat with LLMs without writing any code using a single command and a flexible config system.
- **`Trainers`**: The Trainer classes are an abstraction to apply many fine-tuning methods with ease such as the [`SFTTrainer`](https://huggingface.co/docs/trl/sft_trainer), [`FPOTrainer`](https://huggingface.co/docs/trl/trainer#trl.FPOTrainer), [`RewardTrainer`](https://huggingface.co/docs/trl/reward_trainer), [`PPOTrainer`](https://huggingface.co/docs/trl/trainer#trl.PPOTrainer), [`CPOTrainer`](https://huggingface.co/docs/trl/trainer#trl.CPOTrainer), and [`ORPOTrainer`](https://huggingface.co/docs/trl/trainer#trl.ORPOTrainer).
- **`AutoModels`**: The [`AutoModelForCausalLMWithValueHead`](https://huggingface.co/docs/trl/models#trl.AutoModelForCausalLMWithValueHead) & [`AutoModelForSeq2SeqLMWithValueHead`](https://huggingface.co/docs/trl/models#trl.AutoModelForSeq2SeqLMWithValueHead) classes add an additional value head to the model which allows to train them with RL algorithms such as PPO.
- **`Examples`**: Train GPT2 to generate positive movie reviews with a BERT sentiment classifier, full RLHF using adapters only, train GPT-j to be less toxic, [StackLlama example](https://huggingface.co/blog/stackllama), etc. following the [examples](https://github.com/huggingface/trl/tree/main/examples).

## Installation

### Python package
Install the library with `pip`:
```bash
pip install trl
```

### From source
If you want to use the latest features before an official release you can install from source:
```bash
pip install git+https://github.com/huggingface/trl.git
```

### Repository
If you want to use the examples you can clone the repository with the following command:
```bash
git clone https://github.com/huggingface/trl.git
```

## Command Line Interface (CLI)

You can use TRL Command Line Interface (CLI) to quickly get started with Supervised Fine-tuning (SFT), Direct Preference Optimization (DPO) and test your aligned model with the chat CLI: 

**SFT:**

```bash
trl sft --model_name_or_path facebook/opt-125m --dataset_name imdb --output_dir opt-sft-imdb
```

**DPO:**

```bash
trl dpo --model_name_or_path facebook/opt-125m --dataset_name trl-internal-testing/hh-rlhf-helpful-base-trl-style --output_dir opt-sft-hh-rlhf 
```

**Chat:**

```bash
trl chat --model_name_or_path Qwen/Qwen1.5-0.5B-Chat
```

Read more about CLI in the [relevant documentation section](https://huggingface.co/docs/trl/main/en/clis) or use `--help` for more details.

## How to use

For more flexibility and control over the training, you can use the dedicated trainer classes to fine-tune the model in Python.

### `SFTTrainer`

This is a basic example of how to use the `SFTTrainer` from the library. The `SFTTrainer` is a light wrapper around the `transformers` Trainer to easily fine-tune language models or adapters on a custom dataset.

```python
# imports
from datasets import load_dataset
from trl import SFTTrainer

# get dataset
dataset = load_dataset("imdb", split="train")

# get trainer
trainer = SFTTrainer(
    "facebook/opt-350m",
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=512,
)

# train
trainer.train()
```

### `RewardTrainer`

This is a basic example of how to use the `RewardTrainer` from the library. The `RewardTrainer` is a wrapper around the `transformers` Trainer to easily fine-tune reward models or adapters on a custom preference dataset.

```python
# imports
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from trl import RewardTrainer

# load model and dataset - dataset needs to be in a specific format
model = AutoModelForSequenceClassification.from_pretrained("gpt2", num_labels=1)
tokenizer = AutoTokenizer.from_pretrained("gpt2")

...

# load trainer
trainer = RewardTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
)

# train
trainer.train()
```

### `PPOTrainer`

This is a basic example of how to use the `PPOTrainer` from the library. Based on a query the language model creates a response which is then evaluated. The evaluation could be a human in the loop or another model's output.

```python
# imports
import torch
from transformers import AutoTokenizer
from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead, create_reference_model
from trl.core import respond_to_batch

# get models
model = AutoModelForCausalLMWithValueHead.from_pretrained('gpt2')
ref_model = create_reference_model(model)

tokenizer = AutoTokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token

# initialize trainer
ppo_config = PPOConfig(batch_size=1, mini_batch_size=1)

# encode a query
query_txt = "This morning I went to the "
query_tensor = tokenizer.encode(query_txt, return_tensors="pt")

# get model response
response_tensor  = respond_to_batch(model, query_tensor)

# create a ppo trainer
ppo_trainer = PPOTrainer(ppo_config, model, ref_model, tokenizer)

# define a reward for response
# (this could be any reward such as human feedback or output from another model)
reward = [torch.tensor(1.0)]

# train model for one step with ppo
train_stats = ppo_trainer.step([query_tensor[0]], [response_tensor[0]], reward)
```

### `FPOTrainer`

`FPOTrainer` is a trainer that uses [Direct Preference Optimization algorithm](https://huggingface.co/papers/2305.18290). This is a basic example of how to use the `FPOTrainer` from the library. The `FPOTrainer` is a wrapper around the `transformers` Trainer to easily fine-tune reward models or adapters on a custom preference dataset.

```python
# imports
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import FPOTrainer

# load model and dataset - dataset needs to be in a specific format
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")

...

# load trainer
trainer = FPOTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
)

# train
trainer.train()
```

## Development

If you want to contribute to `trl` or customizing it to your needs make sure to read the [contribution guide](https://github.com/huggingface/trl/blob/main/CONTRIBUTING.md) and make sure you make a dev install:

```bash
git clone https://github.com/huggingface/trl.git
cd trl/
make dev
```

## References

### Proximal Policy Optimisation
The PPO implementation largely follows the structure introduced in the paper **"Fine-Tuning Language Models from Human Preferences"** by D. Ziegler et al. \[[paper](https://huggingface.co/papers/1909.08593), [code](https://github.com/openai/lm-human-preferences)].

### Direct Preference Optimization
DPO is based on the original implementation of **"Direct Preference Optimization: Your Language Model is Secretly a Reward Model"** by E. Mitchell et al. \[[paper](https://huggingface.co/papers/2305.18290), [code](https://github.com/eric-mitchell/direct-preference-optimization)]


## Citation

```bibtex
@misc{vonwerra2022trl,
  author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang},
  title = {TRL: Transformer Reinforcement Learning},
  year = {2020},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/huggingface/trl}}
}
```

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/huggingface/trl",
    "name": "trl-fpo",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": null,
    "keywords": "ppo, transformers, huggingface, gpt2, language modeling, rlhf",
    "author": "Rajarshi, Gurpreet, Danush",
    "author_email": "royrajarshi0123@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/fa/d1/31cf627fc9bd88eb7f07609b786d4bb6591d5e132b8445c918442d6b39b9/trl_fpo-0.0.14.tar.gz",
    "platform": null,
    "description": "<div style=\"text-align: center\">\n<img src=\"https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/trl_banner_dark.png\">\n</div>\n\n# TRL - Transformer Reinforcement Learning\n> Full stack library to fine-tune and align large language models.\n\n<p align=\"center\">\n    <a href=\"https://github.com/huggingface/trl/blob/main/LICENSE\">\n        <img alt=\"License\" src=\"https://img.shields.io/github/license/huggingface/trl.svg?color=blue\">\n    </a>\n    <a href=\"https://huggingface.co/docs/trl/index\">\n        <img alt=\"Documentation\" src=\"https://img.shields.io/website/http/huggingface.co/docs/trl/index.svg?down_color=red&down_message=offline&up_message=online\">\n    </a>\n    <a href=\"https://github.com/huggingface/trl/releases\">\n        <img alt=\"GitHub release\" src=\"https://img.shields.io/github/release/huggingface/trl.svg\">\n    </a>\n</p>\n\n\n## What is it?\n\nThe `trl` library is a full stack tool to fine-tune and align transformer language and diffusion models using methods such as Supervised Fine-tuning step (SFT), Reward Modeling (RM) and the Proximal Policy Optimization (PPO) as well as Direct Preference Optimization (DPO). \n\nThe library is built on top of the [`transformers`](https://github.com/huggingface/transformers) library and thus allows to use any model architecture available there.\n\n\n## Highlights\n\n- **`Efficient and scalable`**: \n    - [`accelerate`](https://github.com/huggingface/accelerate) is the backbone of `trl` which allows to scale model training from a single GPU to a large scale multi-node cluster with methods such as DDP and DeepSpeed.\n    - [`PEFT`](https://github.com/huggingface/peft) is fully integrated and allows to train even the largest models on modest hardware with quantisation and methods such as LoRA or QLoRA.\n    - [`unsloth`](https://github.com/unslothai/unsloth) is also integrated and allows to significantly speed up training with dedicated kernels.\n- **`CLI`**: With the [CLI](https://huggingface.co/docs/trl/clis) you can fine-tune and chat with LLMs without writing any code using a single command and a flexible config system.\n- **`Trainers`**: The Trainer classes are an abstraction to apply many fine-tuning methods with ease such as the [`SFTTrainer`](https://huggingface.co/docs/trl/sft_trainer), [`FPOTrainer`](https://huggingface.co/docs/trl/trainer#trl.FPOTrainer), [`RewardTrainer`](https://huggingface.co/docs/trl/reward_trainer), [`PPOTrainer`](https://huggingface.co/docs/trl/trainer#trl.PPOTrainer), [`CPOTrainer`](https://huggingface.co/docs/trl/trainer#trl.CPOTrainer), and [`ORPOTrainer`](https://huggingface.co/docs/trl/trainer#trl.ORPOTrainer).\n- **`AutoModels`**: The [`AutoModelForCausalLMWithValueHead`](https://huggingface.co/docs/trl/models#trl.AutoModelForCausalLMWithValueHead) & [`AutoModelForSeq2SeqLMWithValueHead`](https://huggingface.co/docs/trl/models#trl.AutoModelForSeq2SeqLMWithValueHead) classes add an additional value head to the model which allows to train them with RL algorithms such as PPO.\n- **`Examples`**: Train GPT2 to generate positive movie reviews with a BERT sentiment classifier, full RLHF using adapters only, train GPT-j to be less toxic, [StackLlama example](https://huggingface.co/blog/stackllama), etc. following the [examples](https://github.com/huggingface/trl/tree/main/examples).\n\n## Installation\n\n### Python package\nInstall the library with `pip`:\n```bash\npip install trl\n```\n\n### From source\nIf you want to use the latest features before an official release you can install from source:\n```bash\npip install git+https://github.com/huggingface/trl.git\n```\n\n### Repository\nIf you want to use the examples you can clone the repository with the following command:\n```bash\ngit clone https://github.com/huggingface/trl.git\n```\n\n## Command Line Interface (CLI)\n\nYou can use TRL Command Line Interface (CLI) to quickly get started with Supervised Fine-tuning (SFT), Direct Preference Optimization (DPO) and test your aligned model with the chat CLI: \n\n**SFT:**\n\n```bash\ntrl sft --model_name_or_path facebook/opt-125m --dataset_name imdb --output_dir opt-sft-imdb\n```\n\n**DPO:**\n\n```bash\ntrl dpo --model_name_or_path facebook/opt-125m --dataset_name trl-internal-testing/hh-rlhf-helpful-base-trl-style --output_dir opt-sft-hh-rlhf \n```\n\n**Chat:**\n\n```bash\ntrl chat --model_name_or_path Qwen/Qwen1.5-0.5B-Chat\n```\n\nRead more about CLI in the [relevant documentation section](https://huggingface.co/docs/trl/main/en/clis) or use `--help` for more details.\n\n## How to use\n\nFor more flexibility and control over the training, you can use the dedicated trainer classes to fine-tune the model in Python.\n\n### `SFTTrainer`\n\nThis is a basic example of how to use the `SFTTrainer` from the library. The `SFTTrainer` is a light wrapper around the `transformers` Trainer to easily fine-tune language models or adapters on a custom dataset.\n\n```python\n# imports\nfrom datasets import load_dataset\nfrom trl import SFTTrainer\n\n# get dataset\ndataset = load_dataset(\"imdb\", split=\"train\")\n\n# get trainer\ntrainer = SFTTrainer(\n    \"facebook/opt-350m\",\n    train_dataset=dataset,\n    dataset_text_field=\"text\",\n    max_seq_length=512,\n)\n\n# train\ntrainer.train()\n```\n\n### `RewardTrainer`\n\nThis is a basic example of how to use the `RewardTrainer` from the library. The `RewardTrainer` is a wrapper around the `transformers` Trainer to easily fine-tune reward models or adapters on a custom preference dataset.\n\n```python\n# imports\nfrom transformers import AutoModelForSequenceClassification, AutoTokenizer\nfrom trl import RewardTrainer\n\n# load model and dataset - dataset needs to be in a specific format\nmodel = AutoModelForSequenceClassification.from_pretrained(\"gpt2\", num_labels=1)\ntokenizer = AutoTokenizer.from_pretrained(\"gpt2\")\n\n...\n\n# load trainer\ntrainer = RewardTrainer(\n    model=model,\n    tokenizer=tokenizer,\n    train_dataset=dataset,\n)\n\n# train\ntrainer.train()\n```\n\n### `PPOTrainer`\n\nThis is a basic example of how to use the `PPOTrainer` from the library. Based on a query the language model creates a response which is then evaluated. The evaluation could be a human in the loop or another model's output.\n\n```python\n# imports\nimport torch\nfrom transformers import AutoTokenizer\nfrom trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead, create_reference_model\nfrom trl.core import respond_to_batch\n\n# get models\nmodel = AutoModelForCausalLMWithValueHead.from_pretrained('gpt2')\nref_model = create_reference_model(model)\n\ntokenizer = AutoTokenizer.from_pretrained('gpt2')\ntokenizer.pad_token = tokenizer.eos_token\n\n# initialize trainer\nppo_config = PPOConfig(batch_size=1, mini_batch_size=1)\n\n# encode a query\nquery_txt = \"This morning I went to the \"\nquery_tensor = tokenizer.encode(query_txt, return_tensors=\"pt\")\n\n# get model response\nresponse_tensor  = respond_to_batch(model, query_tensor)\n\n# create a ppo trainer\nppo_trainer = PPOTrainer(ppo_config, model, ref_model, tokenizer)\n\n# define a reward for response\n# (this could be any reward such as human feedback or output from another model)\nreward = [torch.tensor(1.0)]\n\n# train model for one step with ppo\ntrain_stats = ppo_trainer.step([query_tensor[0]], [response_tensor[0]], reward)\n```\n\n### `FPOTrainer`\n\n`FPOTrainer` is a trainer that uses [Direct Preference Optimization algorithm](https://huggingface.co/papers/2305.18290). This is a basic example of how to use the `FPOTrainer` from the library. The `FPOTrainer` is a wrapper around the `transformers` Trainer to easily fine-tune reward models or adapters on a custom preference dataset.\n\n```python\n# imports\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\nfrom trl import FPOTrainer\n\n# load model and dataset - dataset needs to be in a specific format\nmodel = AutoModelForCausalLM.from_pretrained(\"gpt2\")\ntokenizer = AutoTokenizer.from_pretrained(\"gpt2\")\n\n...\n\n# load trainer\ntrainer = FPOTrainer(\n    model=model,\n    tokenizer=tokenizer,\n    train_dataset=dataset,\n)\n\n# train\ntrainer.train()\n```\n\n## Development\n\nIf you want to contribute to `trl` or customizing it to your needs make sure to read the [contribution guide](https://github.com/huggingface/trl/blob/main/CONTRIBUTING.md) and make sure you make a dev install:\n\n```bash\ngit clone https://github.com/huggingface/trl.git\ncd trl/\nmake dev\n```\n\n## References\n\n### Proximal Policy Optimisation\nThe PPO implementation largely follows the structure introduced in the paper **\"Fine-Tuning Language Models from Human Preferences\"** by D. Ziegler et al. \\[[paper](https://huggingface.co/papers/1909.08593), [code](https://github.com/openai/lm-human-preferences)].\n\n### Direct Preference Optimization\nDPO is based on the original implementation of **\"Direct Preference Optimization: Your Language Model is Secretly a Reward Model\"** by E. Mitchell et al. \\[[paper](https://huggingface.co/papers/2305.18290), [code](https://github.com/eric-mitchell/direct-preference-optimization)]\n\n\n## Citation\n\n```bibtex\n@misc{vonwerra2022trl,\n  author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang},\n  title = {TRL: Transformer Reinforcement Learning},\n  year = {2020},\n  publisher = {GitHub},\n  journal = {GitHub repository},\n  howpublished = {\\url{https://github.com/huggingface/trl}}\n}\n```\n",
    "bugtrack_url": null,
    "license": "Apache 2.0",
    "summary": "Train transformer language models with reinforcement learning.",
    "version": "0.0.14",
    "project_urls": {
        "Homepage": "https://github.com/huggingface/trl"
    },
    "split_keywords": [
        "ppo",
        " transformers",
        " huggingface",
        " gpt2",
        " language modeling",
        " rlhf"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "96536ffa02354318340f34f5f4cd2b5c28f5b74496806e50f97e7432fc469d19",
                "md5": "fcedcb7fa66087a166f208641dac87f1",
                "sha256": "8e90c27167df3def0fbfb097b926f8e6513ceea528306381fc36cdf5d9a4a81d"
            },
            "downloads": -1,
            "filename": "trl_fpo-0.0.14-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "fcedcb7fa66087a166f208641dac87f1",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 296029,
            "upload_time": "2025-01-18T04:51:52",
            "upload_time_iso_8601": "2025-01-18T04:51:52.317432Z",
            "url": "https://files.pythonhosted.org/packages/96/53/6ffa02354318340f34f5f4cd2b5c28f5b74496806e50f97e7432fc469d19/trl_fpo-0.0.14-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "fad131cf627fc9bd88eb7f07609b786d4bb6591d5e132b8445c918442d6b39b9",
                "md5": "74c620dc1d27b7e41a90160da8ff3e0f",
                "sha256": "95c8cfe92523acfc45f5194d99de1c5b09aa95b1afa8b7f3b73cf81afabca3db"
            },
            "downloads": -1,
            "filename": "trl_fpo-0.0.14.tar.gz",
            "has_sig": false,
            "md5_digest": "74c620dc1d27b7e41a90160da8ff3e0f",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 230982,
            "upload_time": "2025-01-18T04:51:57",
            "upload_time_iso_8601": "2025-01-18T04:51:57.668401Z",
            "url": "https://files.pythonhosted.org/packages/fa/d1/31cf627fc9bd88eb7f07609b786d4bb6591d5e132b8445c918442d6b39b9/trl_fpo-0.0.14.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-01-18 04:51:57",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "huggingface",
    "github_project": "trl",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "accelerate",
            "specs": []
        },
        {
            "name": "datasets",
            "specs": []
        },
        {
            "name": "rich",
            "specs": []
        },
        {
            "name": "transformers",
            "specs": [
                [
                    ">=",
                    "4.46.0"
                ]
            ]
        }
    ],
    "lcname": "trl-fpo"
}

Rajarshi, Gurpreet, Danush