# TextRL: Text Generation with Reinforcement Learning
<p align="center">
<a href="https://pypi.org/project/textrl/">
<img alt="PyPI" src="https://img.shields.io/pypi/v/textrl">
</a>
<a href="https://github.com/voidful/tfkit">
<img alt="Download" src="https://img.shields.io/pypi/dm/textrl">
</a>
<a href="https://github.com/voidful/tfkit">
<img alt="Last Commit" src="https://img.shields.io/github/last-commit/voidful/textrl">
</a>
<a href="https://www.codefactor.io/repository/github/voidful/textrl">
<img src="https://www.codefactor.io/repository/github/voidful/textrl/badge" alt="CodeFactor" />
</a>
<a href="https://github.com/voidful/textrl">
<img src="https://visitor-badge.glitch.me/badge?page_id=voidful.textrl" alt="Visitor" />
</a>
</p>
TextRL is a Python library that aims to improve text generation using reinforcement learning, building upon Hugging Face's Transformers, PFRL, and OpenAI GYM. TextRL is designed to be easily customizable and can be applied to various text-generation models.
![TextRL](https://github.com/voidful/TextRL/raw/main/img/Designer.png)
## Table of Contents
- [Introduction](#introduction)
- [Examples](#examples)
- [GPT-2 Example](#gpt-2-example)
- [FLAN-T5 Example](#flan-t5-example)
- [Bigscience/BLOOMZ-7B1-MT Example](#bigsciencebloomz-7b1-mt-example)
- [176B BLOOM Example](#176b-bloom-example)
- [Controllable Generation via RL Example](#controllable-generation-via-rl-example)
- [Installation](#installation)
- [Pip Install](#pip-install)
- [Build from Source](#build-from-source)
- [Usage](#usage)
- [Initialize Agent and Environment](#initialize-agent-and-environment)
- [Setup Reward Function for Environment](#setup-reward-function-for-environment)
- [Prepare for Training](#prepare-for-training)
- [Training](#training)
- [Dump Model](#dump-trained-model-to-huggingfaces-model)
- [Key Parameters for RL Training](#key-parameters-for-rl-training)
## Introduction
TextRL utilizes reinforcement learning to fine-tune text generation models. It is built upon the following libraries:
- [Hugging Face's Transformers](https://github.com/huggingface/transformers)
- [PFRL](https://github.com/pfnet/pfrl)
- [OpenAI GYM](https://gym.openai.com)
## Example - `gpt2`
<details><summary>CLICK ME</summary>
<p>
#### GPT2 Example
```python
import pfrl
from textrl import TextRLEnv, TextRLActor, train_agent_with_evaluation
from transformers import AutoModelForCausalLM, AutoTokenizer
import logging
import sys
logging.basicConfig(level=logging.INFO, stream=sys.stdout, format='')
checkpoint = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint, torch_dtype="auto", device_map="auto")
model = model.cuda()
class MyRLEnv(TextRLEnv):
def get_reward(self, input_item, predicted_list, finish): # predicted will be the list of predicted token
reward = [0]
if finish:
reward = [1] # calculate reward score base on predicted_list
return reward
observaton_list = [{"input":"explain how attention work in seq2seq model"}]
env = TextRLEnv(model, tokenizer, observation_input=observaton_list, max_length=20, compare_sample=2)
actor = TextRLActor(env, model, tokenizer,
act_deterministically=False,
temperature=1.0,
top_k=0,
top_p=1.0,
repetition_penalty=2)
agent = actor.agent_ppo(update_interval=2, minibatch_size=2, epochs=10)
print(actor.predict(observaton_list[0]))
train_agent_with_evaluation(
agent,
env,
steps=100,
eval_n_steps=None,
eval_n_episodes=1,
eval_interval=2,
outdir='bloom—test',
)
print(actor.predict(observaton_list[0]))
```
</p>
</details>
## Example - `flan-t5`
<details><summary>CLICK ME</summary>
<p>
#### Example Code
colab
example: [google/flan-t5-base](https://colab.research.google.com/drive/1DYHt0mi6cyl8ZTMJEkMNpsSZCCvR4jM1?usp=sharing)
```python
import pfrl
from textrl import TextRLEnv, TextRLActor, train_agent_with_evaluation
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import logging
import sys
logging.basicConfig(level=logging.INFO, stream=sys.stdout, format='')
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")
model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base")
model.eval()
model.cuda()
sentiment = pipeline('sentiment-analysis',model="cardiffnlp/twitter-roberta-base-sentiment",tokenizer="cardiffnlp/twitter-roberta-base-sentiment",device=0,return_all_scores=True)
class MyRLEnv(TextRLEnv):
def get_reward(self, input_item, predicted_list, finish): # predicted will be the list of predicted token
reward = 0
if finish or len(predicted_list[0]) >= self.env_max_length:
predicted_text = tokenizer.convert_tokens_to_string(predicted_list[0])
# sentiment classifier
reward = sentiment(input_item['input']+predicted_text)[0][0]['score'] * 10
return reward
observaton_list = [{'input':'i think dogecoin is'}]
env = MyRLEnv(model, tokenizer, observation_input=observaton_list, compare_sample=1)
actor = TextRLActor(env,model,tokenizer,optimizer='adamw',
temperature=0.8,
top_k=100,
top_p=0.85,)
agent = actor.agent_ppo(update_interval=50, minibatch_size=3, epochs=10,lr=3e-4)
print(actor.predict(observaton_list[0]))
pfrl.experiments.train_agent_with_evaluation(
agent,
env,
steps=3000,
eval_n_steps=None,
eval_n_episodes=1,
train_max_episode_len=100,
eval_interval=10,
outdir='checkpoint',
)
agent.load("./checkpoint/best")
print(actor.predict(observaton_list[0]))
```
</p>
</details>
## Example - `bigscience/bloomz-7b1-mt`
<details><summary>CLICK ME</summary>
<p>
#### bloomz-7b1-mt Example
```python
import pfrl
from textrl import TextRLEnv, TextRLActor, train_agent_with_evaluation
from transformers import AutoModelForCausalLM, AutoTokenizer
import logging
import sys
logging.basicConfig(level=logging.INFO, stream=sys.stdout, format='')
checkpoint = "bigscience/bloomz-7b1-mt"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint, torch_dtype="auto", device_map="auto")
model = model.cuda()
class MyRLEnv(TextRLEnv):
def get_reward(self, input_item, predicted_list, finish): # predicted will be the list of predicted token
reward = [0]
if finish:
reward = [1] # calculate reward score base on predicted_list
return reward
observaton_list = [{"input":"explain how attention work in seq2seq model"}]
env = TextRLEnv(model, tokenizer, observation_input=observaton_list, max_length=20, compare_sample=2)
actor = TextRLActor(env, model, tokenizer,
act_deterministically=False,
temperature=1.0,
top_k=0,
top_p=1.0)
agent = actor.agent_ppo(update_interval=2, minibatch_size=2, epochs=10)
print(actor.predict(observaton_list[0]))
train_agent_with_evaluation(
agent,
env,
steps=100,
eval_n_steps=None,
eval_n_episodes=1,
eval_interval=2,
outdir='bloom—test',
)
print(actor.predict(observaton_list[0]))
```
</p>
</details>
## Example - 176B BLOOM
<details><summary>CLICK ME</summary>
<p>
#### bloomz-176B Example
Strongly recommend contribute on public swarm to increase petals capacity
https://github.com/bigscience-workshop/petals
install `pip install petals -U` first
```python
import pfrl
from textrl import TextRLEnv, TextRLActor, train_agent_with_evaluation
from transformers import BloomTokenizerFast
from petals import DistributedBloomForCausalLM
import logging
import sys
logging.basicConfig(level=logging.INFO, stream=sys.stdout, format='')
MODEL_NAME = "bigscience/bloom-petals"
tokenizer = BloomTokenizerFast.from_pretrained(MODEL_NAME)
model = DistributedBloomForCausalLM.from_pretrained(MODEL_NAME)
model = model.cuda()
class MyRLEnv(TextRLEnv):
def get_reward(self, input_item, predicted_list, finish): # predicted will be the list of predicted token
reward = [0]
if finish:
reward = [1] # calculate reward score base on predicted_list
return reward
observaton_list = [{"input":"explain how attention work in seq2seq model"}]
env = TextRLEnv(model, tokenizer, observation_input=observaton_list, max_length=20, compare_sample=2)
actor = TextRLActor(env, model, tokenizer,
act_deterministically=False,
temperature=1.0,
top_k=0,
top_p=1.0)
agent = actor.agent_ppo(update_interval=2, minibatch_size=2, epochs=10)
print(actor.predict(observaton_list[0]))
train_agent_with_evaluation(
agent,
env,
steps=100,
eval_n_steps=None,
eval_n_episodes=1,
eval_interval=2,
outdir='bloom—test',
)
print(actor.predict(observaton_list[0]))
```
</p>
</details>
## Example - Controllable generation via RL to let Elon Musk speak ill of DOGE
<details><summary>CLICK ME</summary>
<p>
[Controllable generation via RL to let Elon Musk speak ill of DOGE
](https://github.com/voidful/TextRL/blob/main/example/2022-12-10-textrl-elon-musk.ipynb)
colab
example: [bigscience/bloom-560m](https://colab.research.google.com/drive/1ThSHtkfzC2dDc6JOdeCTthuDovTCheRf?usp=sharing)
colab
exmaple: [huggingtweets/elonmusk](https://colab.research.google.com/drive/149MG6uxu7CjMU1pXnYXfSvJ6HEdwcOFt?usp=sharing)
before: `i think dogecoin is a great idea.`
after: `i think dogecoin is a great idea, but I think it is a little overused.`
</p>
</details>
## Installation
### pip install
```bash
pip install pfrl@git+https://github.com/voidful/pfrl.git
pip install textrl
```
### Build from source
git clone and cd into this project.
```bash
pip install -e .
```
## Usage
### Initialize agent and environment
```python
import torch
from textrl import TextRLEnv, TextRLActor, train_agent_with_evaluation
from transformers import AutoModelForCausalLM, AutoTokenizer
checkpoint = "bigscience/bloomz-7b1-mt"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint, torch_dtype="auto", device_map="auto")
model = model.cuda()
```
### Set up reward function for environment
- predicted(list\[str]): will be the list of predicted tokens
- finish(bool): whether the end of sentence has been reached or not
```python
class MyRLEnv(TextRLEnv):
def get_reward(self, input_item, predicted_list, finish):
if finish:
reward = [0] # calculate reward score based on predicted_list
return reward
```
### Prepare for training
- observation\_list should be a list of all possible input strings for model training
Example: `observation_list = [{"input":'testing sent 1'},{"input":'testing sent 2'}]`
```python
env = MyRLEnv(model, tokenizer, observation_input=observation_list)
actor = TextRLActor(env, model, tokenizer)
agent = actor.agent_ppo(update_interval=10, minibatch_size=2000, epochs=20)
```
### Train
```python
n_episodes = 1000
max_episode_len = 200 # max sentence length
for i in range(1, n_episodes + 1):
obs = env.reset()
R = 0
t = 0
while True:
action = agent.act(obs)
obs, reward, done, pred = env.step(action)
R += reward
t += 1
reset = t == max_episode_len
agent.observe(obs, reward, done, reset)
if done or reset:
break
if i % 10 == 0:
print('episode:', i, 'R:', R)
if i % 50 == 0:
print('statistics:', agent.get_statistics())
print('Finished.')
```
Another way to train:
```python
import logging
import sys
logging.basicConfig(level=logging.INFO, stream=sys.stdout, format='')
train_agent_with_evaluation(
agent,
env,
steps=1000,
eval_n_steps=None,
eval_n_episodes=1500,
train_max_episode_len=50,
eval_interval=10000,
outdir='somewhere',
)
```
### Prediction
```python
agent.load("somewhere/best") # loading the best model
actor.predict("input text")
```
This updated usage section provides a comprehensive guide on how to initialize the agent and environment, set up the reward function for the environment, prepare for training, train the model, and make predictions. It also includes an alternative way to train the model using the `train_agent_with_evaluation` function.
## Dump trained model to huggingface's model
```shell
textrl-dump --model ./model_path_before_rl --rl ./rl_path --dump ./output_dir
```
## Key Parameters for RL Training
To finetune a language model using RL, you need to modify the reward function:
```python
from textrl import TextRLEnv
class MyRLEnv(TextRLEnv):
def get_reward(self, input_item, predicted_list, finish):
# input_item is the prompt input for the model, it will be one of your observation
# an observation will be a list of sentence of eg: ['inputted sentence','xxx','yyy']
# only the first input will feed to the model 'inputted sentence', and
# the remaining can be the reference for reward calculation
# predicted_list is the list of predicted sentences of RL model generated,
# it will be used for ranking reward calculation
# finish is the end of sentences flags, get_reward will be called during generating each word, and
# when finish is True, it means the sentence is finished, it will use for sentence level reward calculation.
# reward should be the list equal to the length of predicted_list
return reward
```
Parameters for sampling diverse examples:
```python
actor = TextRLActor(env, model, tokenizer,
act_deterministically=False, # select the max probability token for each step or not
temperature=1, # temperature for sampling
compare_sample=2, # num of sample to rank
top_k=0, # top k sampling
top_p=1.0,) # top p sampling
```
When training a reinforcement learning (RL) model, several key parameters need to be tuned to ensure optimal performance. Here is a list of important parameters and their descriptions:
1. **Update Interval**: This determines how often the RL agent updates its policy based on collected experiences. A smaller update interval means the agent learns more frequently from recent experiences, while a larger interval allows more experiences to accumulate before learning. In the example above, the update interval is set to 10.
```python
update_interval=10
```
2. **Minibatch Size**: The number of experiences sampled from the experience replay buffer to compute the gradient update. A larger minibatch size helps to stabilize learning and reduce variance, but at the cost of increased computational requirements.
```python
minibatch_size=2000
```
3. **Epochs**: The number of times the agent iterates through the entire minibatch to update its policy. More epochs can lead to better learning but may increase the risk of overfitting.
```python
epochs=20
```
4. **Discount Factor (Gamma)**: This parameter determines how much future rewards are discounted when calculating the expected return. A value closer to 1 makes the agent more farsighted, while a value closer to 0 makes the agent more focused on immediate rewards.
```python
gamma=0.99
```
5. **Learning Rate**: The step size used for updating the policy. A larger learning rate allows for faster convergence but may lead to instability in learning, while a smaller learning rate ensures stable learning at the cost of slower convergence.
```python
lr=1e-4
```
6. **Epsilon**: A parameter used in the PPO algorithm to clip the policy ratio. This helps to control the magnitude of policy updates, preventing excessively large updates that can destabilize learning.
```python
epsilon=0.2
```
7. **Entropy Coefficient**: This parameter encourages exploration by adding a bonus reward for taking less certain actions. A higher entropy coefficient promotes more exploration, while a lower coefficient focuses the agent on exploiting known strategies.
```python
entropy_coef=0.01
```
8. **Training Steps**: The total number of steps the agent takes during training. More steps typically lead to better learning but may require more computational time.
```python
steps=1000
```
9. **Evaluation Interval**: The number of training steps between evaluations. Increasing the evaluation interval reduces the computational time spent on evaluation, but it may also reduce the frequency at which you can monitor the agent's progress.
```python
eval_interval=10000
```
10. **Max Episode Length**: The maximum number of steps allowed in a single episode during training. This can prevent the agent from getting stuck in long, unproductive episodes.
```python
train_max_episode_len=50
```
These parameters need to be carefully tuned based on the specific problem and environment to achieve the best performance. It is generally recommended to start with default values and then adjust them based on the observed learning behavior.
Raw data
{
"_id": null,
"home_page": "https://github.com/voidful/TextRL",
"name": "textrl",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.5.0",
"maintainer_email": "",
"keywords": "transformer huggingface nlp generation reinforcement learning deep learning",
"author": "Voidful",
"author_email": "voidful.stack@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/c2/4a/d206c015303303b8d5b0cbe9a58c72204a10e72e0f655a75cd3592fbfa48/textrl-0.2.22.tar.gz",
"platform": null,
"description": "# TextRL: Text Generation with Reinforcement Learning\n\n<p align=\"center\">\n <a href=\"https://pypi.org/project/textrl/\">\n <img alt=\"PyPI\" src=\"https://img.shields.io/pypi/v/textrl\">\n </a>\n <a href=\"https://github.com/voidful/tfkit\">\n <img alt=\"Download\" src=\"https://img.shields.io/pypi/dm/textrl\">\n </a>\n <a href=\"https://github.com/voidful/tfkit\">\n <img alt=\"Last Commit\" src=\"https://img.shields.io/github/last-commit/voidful/textrl\">\n </a>\n <a href=\"https://www.codefactor.io/repository/github/voidful/textrl\">\n <img src=\"https://www.codefactor.io/repository/github/voidful/textrl/badge\" alt=\"CodeFactor\" />\n </a>\n <a href=\"https://github.com/voidful/textrl\">\n <img src=\"https://visitor-badge.glitch.me/badge?page_id=voidful.textrl\" alt=\"Visitor\" />\n </a>\n</p>\n\nTextRL is a Python library that aims to improve text generation using reinforcement learning, building upon Hugging Face's Transformers, PFRL, and OpenAI GYM. TextRL is designed to be easily customizable and can be applied to various text-generation models.\n\n![TextRL](https://github.com/voidful/TextRL/raw/main/img/Designer.png)\n\n## Table of Contents\n\n- [Introduction](#introduction)\n- [Examples](#examples)\n - [GPT-2 Example](#gpt-2-example)\n - [FLAN-T5 Example](#flan-t5-example)\n - [Bigscience/BLOOMZ-7B1-MT Example](#bigsciencebloomz-7b1-mt-example)\n - [176B BLOOM Example](#176b-bloom-example)\n - [Controllable Generation via RL Example](#controllable-generation-via-rl-example)\n- [Installation](#installation)\n - [Pip Install](#pip-install)\n - [Build from Source](#build-from-source)\n- [Usage](#usage)\n - [Initialize Agent and Environment](#initialize-agent-and-environment)\n - [Setup Reward Function for Environment](#setup-reward-function-for-environment)\n - [Prepare for Training](#prepare-for-training)\n - [Training](#training)\n- [Dump Model](#dump-trained-model-to-huggingfaces-model)\n- [Key Parameters for RL Training](#key-parameters-for-rl-training)\n\n## Introduction\n\nTextRL utilizes reinforcement learning to fine-tune text generation models. It is built upon the following libraries:\n\n- [Hugging Face's Transformers](https://github.com/huggingface/transformers)\n- [PFRL](https://github.com/pfnet/pfrl)\n- [OpenAI GYM](https://gym.openai.com)\n\n\n\n## Example - `gpt2`\n\n<details><summary>CLICK ME</summary>\n<p>\n\n#### GPT2 Example\n\n```python\nimport pfrl\nfrom textrl import TextRLEnv, TextRLActor, train_agent_with_evaluation\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\nimport logging\nimport sys\n\nlogging.basicConfig(level=logging.INFO, stream=sys.stdout, format='')\n\ncheckpoint = \"gpt2\"\n\ntokenizer = AutoTokenizer.from_pretrained(checkpoint)\nmodel = AutoModelForCausalLM.from_pretrained(checkpoint, torch_dtype=\"auto\", device_map=\"auto\")\n\nmodel = model.cuda()\n\n\nclass MyRLEnv(TextRLEnv):\n def get_reward(self, input_item, predicted_list, finish): # predicted will be the list of predicted token\n reward = [0]\n if finish:\n reward = [1] # calculate reward score base on predicted_list\n return reward\n\n\nobservaton_list = [{\"input\":\"explain how attention work in seq2seq model\"}]\nenv = TextRLEnv(model, tokenizer, observation_input=observaton_list, max_length=20, compare_sample=2)\nactor = TextRLActor(env, model, tokenizer,\n act_deterministically=False,\n temperature=1.0,\n top_k=0,\n top_p=1.0,\n repetition_penalty=2)\nagent = actor.agent_ppo(update_interval=2, minibatch_size=2, epochs=10)\nprint(actor.predict(observaton_list[0]))\n\ntrain_agent_with_evaluation(\n agent,\n env,\n steps=100,\n eval_n_steps=None,\n eval_n_episodes=1,\n eval_interval=2,\n outdir='bloom\u2014test',\n)\n\nprint(actor.predict(observaton_list[0]))\n```\n\n</p>\n</details>\n\n## Example - `flan-t5`\n\n\n<details><summary>CLICK ME</summary>\n<p>\n\n#### Example Code\n\ncolab\nexample: [google/flan-t5-base](https://colab.research.google.com/drive/1DYHt0mi6cyl8ZTMJEkMNpsSZCCvR4jM1?usp=sharing)\n\n```python\nimport pfrl\nfrom textrl import TextRLEnv, TextRLActor, train_agent_with_evaluation\nfrom transformers import AutoModelForSeq2SeqLM, AutoTokenizer\nimport logging\nimport sys\n\nlogging.basicConfig(level=logging.INFO, stream=sys.stdout, format='')\n\n\ntokenizer = AutoTokenizer.from_pretrained(\"google/flan-t5-base\") \nmodel = AutoModelForSeq2SeqLM.from_pretrained(\"google/flan-t5-base\")\nmodel.eval()\nmodel.cuda()\n\nsentiment = pipeline('sentiment-analysis',model=\"cardiffnlp/twitter-roberta-base-sentiment\",tokenizer=\"cardiffnlp/twitter-roberta-base-sentiment\",device=0,return_all_scores=True)\n\nclass MyRLEnv(TextRLEnv):\n def get_reward(self, input_item, predicted_list, finish): # predicted will be the list of predicted token\n reward = 0\n if finish or len(predicted_list[0]) >= self.env_max_length:\n predicted_text = tokenizer.convert_tokens_to_string(predicted_list[0])\n # sentiment classifier\n reward = sentiment(input_item['input']+predicted_text)[0][0]['score'] * 10\n return reward\n\nobservaton_list = [{'input':'i think dogecoin is'}]\nenv = MyRLEnv(model, tokenizer, observation_input=observaton_list, compare_sample=1)\nactor = TextRLActor(env,model,tokenizer,optimizer='adamw',\n temperature=0.8,\n top_k=100,\n top_p=0.85,)\nagent = actor.agent_ppo(update_interval=50, minibatch_size=3, epochs=10,lr=3e-4)\nprint(actor.predict(observaton_list[0]))\n\npfrl.experiments.train_agent_with_evaluation(\n agent,\n env,\n steps=3000,\n eval_n_steps=None,\n eval_n_episodes=1, \n train_max_episode_len=100, \n eval_interval=10,\n outdir='checkpoint', \n)\nagent.load(\"./checkpoint/best\")\nprint(actor.predict(observaton_list[0]))\n```\n\n</p>\n</details>\n\n\n## Example - `bigscience/bloomz-7b1-mt`\n\n<details><summary>CLICK ME</summary>\n<p>\n\n#### bloomz-7b1-mt Example\n\n```python\nimport pfrl\nfrom textrl import TextRLEnv, TextRLActor, train_agent_with_evaluation\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\nimport logging\nimport sys\n\nlogging.basicConfig(level=logging.INFO, stream=sys.stdout, format='')\n\ncheckpoint = \"bigscience/bloomz-7b1-mt\"\n\ntokenizer = AutoTokenizer.from_pretrained(checkpoint)\nmodel = AutoModelForCausalLM.from_pretrained(checkpoint, torch_dtype=\"auto\", device_map=\"auto\")\n\nmodel = model.cuda()\n\n\nclass MyRLEnv(TextRLEnv):\n def get_reward(self, input_item, predicted_list, finish): # predicted will be the list of predicted token\n reward = [0]\n if finish:\n reward = [1] # calculate reward score base on predicted_list\n return reward\n\n\nobservaton_list = [{\"input\":\"explain how attention work in seq2seq model\"}]\nenv = TextRLEnv(model, tokenizer, observation_input=observaton_list, max_length=20, compare_sample=2)\nactor = TextRLActor(env, model, tokenizer,\n act_deterministically=False,\n temperature=1.0,\n top_k=0,\n top_p=1.0)\nagent = actor.agent_ppo(update_interval=2, minibatch_size=2, epochs=10)\nprint(actor.predict(observaton_list[0]))\n\ntrain_agent_with_evaluation(\n agent,\n env,\n steps=100,\n eval_n_steps=None,\n eval_n_episodes=1,\n eval_interval=2,\n outdir='bloom\u2014test',\n)\n\nprint(actor.predict(observaton_list[0]))\n```\n\n</p>\n</details>\n\n## Example - 176B BLOOM\n\n<details><summary>CLICK ME</summary>\n<p>\n\n#### bloomz-176B Example\n\nStrongly recommend contribute on public swarm to increase petals capacity\n\nhttps://github.com/bigscience-workshop/petals\n\ninstall `pip install petals -U` first\n\n\n```python\nimport pfrl\nfrom textrl import TextRLEnv, TextRLActor, train_agent_with_evaluation\nfrom transformers import BloomTokenizerFast\nfrom petals import DistributedBloomForCausalLM\nimport logging\nimport sys\n\nlogging.basicConfig(level=logging.INFO, stream=sys.stdout, format='')\n\nMODEL_NAME = \"bigscience/bloom-petals\"\ntokenizer = BloomTokenizerFast.from_pretrained(MODEL_NAME)\nmodel = DistributedBloomForCausalLM.from_pretrained(MODEL_NAME)\nmodel = model.cuda()\n\n\nclass MyRLEnv(TextRLEnv):\n def get_reward(self, input_item, predicted_list, finish): # predicted will be the list of predicted token\n reward = [0]\n if finish:\n reward = [1] # calculate reward score base on predicted_list\n return reward\n\n\nobservaton_list = [{\"input\":\"explain how attention work in seq2seq model\"}]\nenv = TextRLEnv(model, tokenizer, observation_input=observaton_list, max_length=20, compare_sample=2)\nactor = TextRLActor(env, model, tokenizer,\n act_deterministically=False,\n temperature=1.0,\n top_k=0,\n top_p=1.0)\nagent = actor.agent_ppo(update_interval=2, minibatch_size=2, epochs=10)\n\nprint(actor.predict(observaton_list[0]))\n\ntrain_agent_with_evaluation(\n agent,\n env,\n steps=100,\n eval_n_steps=None,\n eval_n_episodes=1,\n eval_interval=2,\n outdir='bloom\u2014test',\n)\n\nprint(actor.predict(observaton_list[0]))\n```\n\n</p>\n</details>\n\n## Example - Controllable generation via RL to let Elon Musk speak ill of DOGE\n\n<details><summary>CLICK ME</summary>\n<p>\n[Controllable generation via RL to let Elon Musk speak ill of DOGE\n](https://github.com/voidful/TextRL/blob/main/example/2022-12-10-textrl-elon-musk.ipynb)\n\ncolab\nexample: [bigscience/bloom-560m](https://colab.research.google.com/drive/1ThSHtkfzC2dDc6JOdeCTthuDovTCheRf?usp=sharing)\n\ncolab\nexmaple: [huggingtweets/elonmusk](https://colab.research.google.com/drive/149MG6uxu7CjMU1pXnYXfSvJ6HEdwcOFt?usp=sharing)\n\nbefore: `i think dogecoin is a great idea.` \nafter: `i think dogecoin is a great idea, but I think it is a little overused.`\n</p>\n</details>\n\n## Installation\n\n### pip install\n\n```bash\npip install pfrl@git+https://github.com/voidful/pfrl.git\npip install textrl\n```\n\n### Build from source\n\ngit clone and cd into this project.\n\n```bash\npip install -e .\n```\n\n## Usage\n\n### Initialize agent and environment\n\n```python\nimport torch\nfrom textrl import TextRLEnv, TextRLActor, train_agent_with_evaluation\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\n\ncheckpoint = \"bigscience/bloomz-7b1-mt\"\n\ntokenizer = AutoTokenizer.from_pretrained(checkpoint)\nmodel = AutoModelForCausalLM.from_pretrained(checkpoint, torch_dtype=\"auto\", device_map=\"auto\")\n\nmodel = model.cuda()\n```\n\n### Set up reward function for environment\n\n- predicted(list\\[str]): will be the list of predicted tokens\n- finish(bool): whether the end of sentence has been reached or not\n\n```python\nclass MyRLEnv(TextRLEnv):\n def get_reward(self, input_item, predicted_list, finish):\n if finish:\n reward = [0] # calculate reward score based on predicted_list\n return reward\n```\n\n### Prepare for training\n\n- observation\\_list should be a list of all possible input strings for model training\n\n Example: `observation_list = [{\"input\":'testing sent 1'},{\"input\":'testing sent 2'}]`\n\n```python\nenv = MyRLEnv(model, tokenizer, observation_input=observation_list)\nactor = TextRLActor(env, model, tokenizer)\nagent = actor.agent_ppo(update_interval=10, minibatch_size=2000, epochs=20)\n```\n\n### Train\n\n```python\nn_episodes = 1000\nmax_episode_len = 200 # max sentence length\n\nfor i in range(1, n_episodes + 1):\n obs = env.reset()\n R = 0\n t = 0\n while True:\n action = agent.act(obs)\n obs, reward, done, pred = env.step(action)\n R += reward\n t += 1\n reset = t == max_episode_len\n agent.observe(obs, reward, done, reset)\n if done or reset:\n break\n if i % 10 == 0:\n print('episode:', i, 'R:', R)\n if i % 50 == 0:\n print('statistics:', agent.get_statistics())\nprint('Finished.')\n```\n\nAnother way to train:\n\n```python\nimport logging\nimport sys\n\nlogging.basicConfig(level=logging.INFO, stream=sys.stdout, format='')\n\ntrain_agent_with_evaluation(\n agent,\n env,\n steps=1000,\n eval_n_steps=None,\n eval_n_episodes=1500,\n train_max_episode_len=50,\n eval_interval=10000,\n outdir='somewhere',\n)\n```\n\n### Prediction\n\n```python\nagent.load(\"somewhere/best\") # loading the best model\nactor.predict(\"input text\")\n```\n\nThis updated usage section provides a comprehensive guide on how to initialize the agent and environment, set up the reward function for the environment, prepare for training, train the model, and make predictions. It also includes an alternative way to train the model using the `train_agent_with_evaluation` function.\n\n## Dump trained model to huggingface's model\n\n```shell\ntextrl-dump --model ./model_path_before_rl --rl ./rl_path --dump ./output_dir\n```\n\n## Key Parameters for RL Training\n\nTo finetune a language model using RL, you need to modify the reward function:\n\n```python\nfrom textrl import TextRLEnv\n\nclass MyRLEnv(TextRLEnv):\n def get_reward(self, input_item, predicted_list, finish):\n # input_item is the prompt input for the model, it will be one of your observation\n # an observation will be a list of sentence of eg: ['inputted sentence','xxx','yyy']\n # only the first input will feed to the model 'inputted sentence', and \n # the remaining can be the reference for reward calculation\n\n # predicted_list is the list of predicted sentences of RL model generated,\n # it will be used for ranking reward calculation\n\n # finish is the end of sentences flags, get_reward will be called during generating each word, and \n # when finish is True, it means the sentence is finished, it will use for sentence level reward calculation.\n\n # reward should be the list equal to the length of predicted_list\n return reward\n```\n\nParameters for sampling diverse examples:\n\n```python\nactor = TextRLActor(env, model, tokenizer,\n act_deterministically=False, # select the max probability token for each step or not\n temperature=1, # temperature for sampling\n compare_sample=2, # num of sample to rank\n top_k=0, # top k sampling\n top_p=1.0,) # top p sampling\n```\n\nWhen training a reinforcement learning (RL) model, several key parameters need to be tuned to ensure optimal performance. Here is a list of important parameters and their descriptions:\n\n1. **Update Interval**: This determines how often the RL agent updates its policy based on collected experiences. A smaller update interval means the agent learns more frequently from recent experiences, while a larger interval allows more experiences to accumulate before learning. In the example above, the update interval is set to 10.\n\n```python\nupdate_interval=10\n```\n\n2. **Minibatch Size**: The number of experiences sampled from the experience replay buffer to compute the gradient update. A larger minibatch size helps to stabilize learning and reduce variance, but at the cost of increased computational requirements.\n\n```python\nminibatch_size=2000\n```\n\n3. **Epochs**: The number of times the agent iterates through the entire minibatch to update its policy. More epochs can lead to better learning but may increase the risk of overfitting.\n\n```python\nepochs=20\n```\n\n4. **Discount Factor (Gamma)**: This parameter determines how much future rewards are discounted when calculating the expected return. A value closer to 1 makes the agent more farsighted, while a value closer to 0 makes the agent more focused on immediate rewards.\n\n```python\ngamma=0.99\n```\n\n5. **Learning Rate**: The step size used for updating the policy. A larger learning rate allows for faster convergence but may lead to instability in learning, while a smaller learning rate ensures stable learning at the cost of slower convergence.\n\n```python\nlr=1e-4\n```\n\n6. **Epsilon**: A parameter used in the PPO algorithm to clip the policy ratio. This helps to control the magnitude of policy updates, preventing excessively large updates that can destabilize learning.\n\n```python\nepsilon=0.2\n```\n\n7. **Entropy Coefficient**: This parameter encourages exploration by adding a bonus reward for taking less certain actions. A higher entropy coefficient promotes more exploration, while a lower coefficient focuses the agent on exploiting known strategies.\n\n```python\nentropy_coef=0.01\n```\n\n8. **Training Steps**: The total number of steps the agent takes during training. More steps typically lead to better learning but may require more computational time.\n\n```python\nsteps=1000\n```\n\n9. **Evaluation Interval**: The number of training steps between evaluations. Increasing the evaluation interval reduces the computational time spent on evaluation, but it may also reduce the frequency at which you can monitor the agent's progress.\n\n```python\neval_interval=10000\n```\n\n10. **Max Episode Length**: The maximum number of steps allowed in a single episode during training. This can prevent the agent from getting stuck in long, unproductive episodes.\n\n```python\ntrain_max_episode_len=50\n```\n\nThese parameters need to be carefully tuned based on the specific problem and environment to achieve the best performance. It is generally recommended to start with default values and then adjust them based on the observed learning behavior.\n\n\n",
"bugtrack_url": null,
"license": "Apache",
"summary": "TextRL - use reinforcement learning to adjust text generation results.",
"version": "0.2.22",
"project_urls": {
"Homepage": "https://github.com/voidful/TextRL"
},
"split_keywords": [
"transformer",
"huggingface",
"nlp",
"generation",
"reinforcement",
"learning",
"deep",
"learning"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "10f6af65da2fd0c85675e8a9657172a74ca16687bdca473da92e4eeea9bf6c96",
"md5": "4b0ea68a2ec057496d8c0d7f598fdc8f",
"sha256": "c6f00a3e65b67f02da633c992cf4fde9e5db4170be38af2be98efe5b389b585f"
},
"downloads": -1,
"filename": "textrl-0.2.22-py3-none-any.whl",
"has_sig": false,
"md5_digest": "4b0ea68a2ec057496d8c0d7f598fdc8f",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.5.0",
"size": 20577,
"upload_time": "2023-08-06T04:38:09",
"upload_time_iso_8601": "2023-08-06T04:38:09.359174Z",
"url": "https://files.pythonhosted.org/packages/10/f6/af65da2fd0c85675e8a9657172a74ca16687bdca473da92e4eeea9bf6c96/textrl-0.2.22-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "c24ad206c015303303b8d5b0cbe9a58c72204a10e72e0f655a75cd3592fbfa48",
"md5": "5f68915924fdc7802a2721cf21d65677",
"sha256": "09cd5ab61ed7d820d0d450babf1d546e5c982c240c004a663f31d107c67f4603"
},
"downloads": -1,
"filename": "textrl-0.2.22.tar.gz",
"has_sig": false,
"md5_digest": "5f68915924fdc7802a2721cf21d65677",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.5.0",
"size": 401030,
"upload_time": "2023-08-06T04:38:11",
"upload_time_iso_8601": "2023-08-06T04:38:11.467405Z",
"url": "https://files.pythonhosted.org/packages/c2/4a/d206c015303303b8d5b0cbe9a58c72204a10e72e0f655a75cd3592fbfa48/textrl-0.2.22.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-08-06 04:38:11",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "voidful",
"github_project": "TextRL",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "textrl"
}