# InstructLab Training Library





- [Installing](#installing-the-library)
- [Additional Nvidia packages](#additional-nvidia-packages)
- [Using the library](#using-the-library)
- [Learning about the training arguments](#learning-about-training-arguments)
- [`TrainingArgs`](#trainingargs)
- [`DeepSpeedOptions`](#deepspeedoptions)
- [`FSDPOptions`](#fsdpoptions)
- [`loraOptions`](#loraoptions)
- [Learning about `TorchrunArgs` arguments](#learning-about-torchrunargs-arguments)
- [Example training run with arguments](#example-training-run-with-arguments)
To simplify the process of fine-tuning models with the [LAB
method](https://arxiv.org/abs/2403.01081), this library provides a simple training interface.
## Installing the library
To get started with the library, you must clone this repository and install it via `pip`.
Install the library:
```bash
pip install instructlab-training
```
You can then install the library for development:
```bash
pip install -e ./training
```
### Additional NVIDIA packages
This library uses the `flash-attn` package as well as other packages, which rely on NVIDIA-specific CUDA tooling to be installed.
If you are using NVIDIA hardware with CUDA, you need to install the following additional dependencies.
Basic install
```bash
pip install .[cuda]
```
Editable install (development)
```bash
pip install -e .[cuda]
```
## Using the library
You can utilize this training library by importing the necessary items.
```py
from instructlab.training import (
run_training,
TorchrunArgs,
TrainingArgs,
DeepSpeedOptions
)
```
You can then define various training arguments. They will serve as the parameters for your training runs. See:
- [Learning about the training argument](#learning-about-training-arguments)
- [Example training run with arguments](#example-training-run-with-arguments)
## Learning about training arguments
The `TrainingArgs` class provides most of the customization options
for training jobs. There are a number of options you can specify, such as setting
`DeepSpeed` config values or running a `LoRA` training job instead of a full fine-tune.
### `TrainingArgs`
| Field | Description |
| --- | --- |
| model_path | Either a reference to a HuggingFace repo or a path to a model saved in the HuggingFace format. |
| data_path | A path to the `.jsonl` training dataset. This is expected to be in the messages format. |
| ckpt_output_dir | Directory where trained model checkpoints will be saved. |
| data_output_dir | Directory where the processed training data is stored (post filtering/tokenization/masking) |
| max_seq_len | The maximum sequence length to be included in the training set. Samples exceeding this length will be dropped. |
| max_batch_len | Maximum tokens per gpu for each batch that will be handled in a single step. Used as part of the multipack calculation. If running into out-of-memory errors, try to lower this value, but not below the `max_seq_len`. |
| num_epochs | Number of epochs to run through before stopping. |
| effective_batch_size | The amount of samples in a batch to see before we update the model parameters. |
| save_samples | Number of samples the model should see before saving a checkpoint. Consider this to be the checkpoint save frequency. |
| learning_rate | How fast we optimize the weights during gradient descent. Higher values may lead to unstable learning performance. It's generally recommended to have a low learning rate with a high effective batch size. |
| warmup_steps | The number of steps a model should go through before reaching the full learning rate. We start at 0 and linearly climb up to `learning_rate`. |
| is_padding_free | Boolean value to indicate whether or not we're training a padding-free transformer model such as Granite. |
| random_seed | The random seed PyTorch will use. |
| mock_data | Whether or not to use mock, randomly generated, data during training. For debug purposes |
| mock_data_len | Max length of a single mock data sample. Equivalent to `max_seq_len` but for mock data. |
| deepspeed_options | Config options to specify for the DeepSpeed optimizer. |
| lora | Options to specify if you intend to perform a LoRA train instead of a full fine-tune. |
| chat_tmpl_path | Specifies the chat template / special tokens for training. |
| checkpoint_at_epoch | Whether or not we should save a checkpoint at the end of each epoch. |
| fsdp_options | The settings for controlling FSDP when it's selected as the distributed backend. |
| distributed_backend | Specifies which distributed training backend to use. Supported options are "fsdp" and "deepspeed". |
| disable_flash_attn | Disables flash attention when set to true. This allows for training on older devices. |
### `DeepSpeedOptions`
This library only currently support a few options in `DeepSpeedOptions`:
The default is to run with DeepSpeed, so these options only currently
allow you to customize aspects of the ZeRO stage 2 optimizer.
| Field | Description |
| --- | --- |
| cpu_offload_optimizer | Whether or not to do CPU offloading in DeepSpeed stage 2. |
| cpu_offload_optimizer_ratio | Floating point between 0 & 1. Specifies the ratio of parameters updating (i.e. optimizer step) on CPU side. |
| cpu_offload_optimizer_pin_memory | If true, offload to page-locked CPU memory. This could boost throughput at the cost of extra memory overhead. |
| save_samples | The number of samples to see before saving a DeepSpeed checkpoint. |
For more information about DeepSpeed, see [deepspeed.ai](https://www.deepspeed.ai/)
#### `FSDPOptions`
Like DeepSpeed, we only expose a number of parameters for you to modify with FSDP.
They are listed below:
| Field | Description |
| --- | --- |
| cpu_offload_params | When set to true, offload parameters from the accelerator onto the CPU. This is an all-or-nothing option. |
| sharding_strategy | Specifies the model sharding strategy that FSDP should use. Valid options are: `FULL_SHARD` (ZeRO-3), `HYBRID_SHARD` (ZeRO-3*), `SHARD_GRAD_OP` (ZeRO-2), and `NO_SHARD`. |
> [!NOTE]
> For `sharding_strategy` - Only `SHARD_GRAD_OP` has been extensively tested and is actively supported by this library.
### `loraOptions`
LoRA options currently supported:
| Field | Description |
| --- | --- |
| rank | The rank parameter for LoRA training. |
| alpha | The alpha parameter for LoRA training. |
| dropout | The dropout rate for LoRA training. |
| target_modules | The list of target modules for LoRA training. |
| quantize_data_type | The data type for quantization in LoRA training. Valid options are `None` and `"nf4"` |
#### Example run with LoRa options
If you'd like to do a LoRA train, you can specify a LoRA
option to `TrainingArgs` via the `LoraOptions` object.
```python
from instructlab.training import LoraOptions, TrainingArgs
training_args = TrainingArgs(
lora = LoraOptions(
rank = 4,
alpha = 32,
dropout = 0.1,
),
# ...
)
```
### Learning about `TorchrunArgs` arguments
When running the training script, we always invoke `torchrun`.
If you are running a single-GPU system or something that doesn't
otherwise require distributed training configuration, you can create a default object:
```python
run_training(
torchrun_args=TorchrunArgs(),
training_args=TrainingArgs(
# ...
),
)
```
However, if you want to specify a more complex configuration,
the library currently supports all the options that [torchrun accepts
today](https://pytorch.org/docs/stable/elastic/run.html#definitions).
> [!NOTE]
> For more information about the `torchrun` arguments, please consult the [torchrun documentation](https://pytorch.org/docs/stable/elastic/run.html#definitions).
#### Example training run with `TorchrunArgs` arguments
For example, in a 8-GPU, 2-machine system, we would
specify the following torchrun config:
```python
MASTER_ADDR = os.getenv('MASTER_ADDR')
MASTER_PORT = os.getnev('MASTER_PORT')
RDZV_ENDPOINT = f'{MASTER_ADDR}:{MASTER_PORT}'
# on machine 1
torchrun_args = TorchrunArgs(
nnodes = 2, # number of machines
nproc_per_node = 4, # num GPUs per machine
node_rank = 0, # node rank for this machine
rdzv_id = 123,
rdzv_endpoint = RDZV_ENDPOINT
)
run_training(
torchrun_args=torchrun_args,
training_args=training_args
)
```
```python
MASTER_ADDR = os.getenv('MASTER_ADDR')
MASTER_PORT = os.getnev('MASTER_PORT')
RDZV_ENDPOINT = f'{MASTER_ADDR}:{MASTER_PORT}'
# on machine 2
torchrun_args = TorchrunArgs(
nnodes = 2, # number of machines
nproc_per_node = 4, # num GPUs per machine
node_rank = 1, # node rank for this machine
rdzv_id = 123,
rdzv_endpoint = f'{MASTER_ADDR}:{MASTER_PORT}'
)
run_training(
torch_args=torchrun_args,
train_args=training_args
)
```
## Example training run with arguments
Define the training arguments which will serve as the
parameters for our training run:
```py
# define training-specific arguments
training_args = TrainingArgs(
# define data-specific arguments
model_path = "ibm-granite/granite-7b-base",
data_path = "path/to/dataset.jsonl",
ckpt_output_dir = "data/saved_checkpoints",
data_output_dir = "data/outputs",
# define model-trianing parameters
max_seq_len = 4096,
max_batch_len = 60000,
num_epochs = 10,
effective_batch_size = 3840,
save_samples = 250000,
learning_rate = 2e-6,
warmup_steps = 800,
is_padding_free = True, # set this to true when using Granite-based models
random_seed = 42,
)
```
We'll also need to define the settings for running a multi-process job
via `torchrun`. To do this, create a `TorchrunArgs` object.
> [!TIP]
> Note, for single-GPU jobs, you can simply set `nnodes = 1` and `nproc_per_node=1`.
```py
torchrun_args = TorchrunArgs(
nnodes = 1, # number of machines
nproc_per_node = 8, # num GPUs per machine
node_rank = 0, # node rank for this machine
rdzv_id = 123,
rdzv_endpoint = '127.0.0.1:12345'
)
```
Finally, you can just call `run_training` and this library will handle
the rest 🙂.
```py
run_training(
torchrun_args=torchrun_args,
training_args=training_args,
)
Raw data
{
"_id": null,
"home_page": null,
"name": "instructlab-training",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": null,
"author": null,
"author_email": "InstructLab <dev@instructlab.ai>",
"download_url": "https://files.pythonhosted.org/packages/f9/6b/8379b1a9727d3bf72c95570f3bee9936486a631847ba17427667d42c6a03/instructlab_training-0.5.5.tar.gz",
"platform": null,
"description": "# InstructLab Training Library\n\n\n\n\n\n\n\n- [Installing](#installing-the-library)\n - [Additional Nvidia packages](#additional-nvidia-packages)\n- [Using the library](#using-the-library)\n- [Learning about the training arguments](#learning-about-training-arguments)\n - [`TrainingArgs`](#trainingargs)\n - [`DeepSpeedOptions`](#deepspeedoptions)\n - [`FSDPOptions`](#fsdpoptions)\n - [`loraOptions`](#loraoptions)\n- [Learning about `TorchrunArgs` arguments](#learning-about-torchrunargs-arguments)\n- [Example training run with arguments](#example-training-run-with-arguments)\n\nTo simplify the process of fine-tuning models with the [LAB\nmethod](https://arxiv.org/abs/2403.01081), this library provides a simple training interface.\n\n## Installing the library\n\nTo get started with the library, you must clone this repository and install it via `pip`.\n\nInstall the library:\n\n```bash\npip install instructlab-training \n```\n\nYou can then install the library for development:\n\n```bash\npip install -e ./training\n```\n\n### Additional NVIDIA packages\n\nThis library uses the `flash-attn` package as well as other packages, which rely on NVIDIA-specific CUDA tooling to be installed.\nIf you are using NVIDIA hardware with CUDA, you need to install the following additional dependencies.\n\nBasic install\n\n```bash\npip install .[cuda]\n```\n\nEditable install (development)\n\n```bash\npip install -e .[cuda]\n```\n\n## Using the library\n\nYou can utilize this training library by importing the necessary items.\n\n```py\nfrom instructlab.training import (\n run_training,\n TorchrunArgs,\n TrainingArgs,\n DeepSpeedOptions\n)\n```\n\nYou can then define various training arguments. They will serve as the parameters for your training runs. See:\n\n- [Learning about the training argument](#learning-about-training-arguments)\n- [Example training run with arguments](#example-training-run-with-arguments)\n\n## Learning about training arguments\n\nThe `TrainingArgs` class provides most of the customization options\nfor training jobs. There are a number of options you can specify, such as setting\n`DeepSpeed` config values or running a `LoRA` training job instead of a full fine-tune.\n\n### `TrainingArgs`\n\n| Field | Description |\n| --- | --- |\n| model_path | Either a reference to a HuggingFace repo or a path to a model saved in the HuggingFace format. |\n| data_path | A path to the `.jsonl` training dataset. This is expected to be in the messages format. |\n| ckpt_output_dir | Directory where trained model checkpoints will be saved. |\n| data_output_dir | Directory where the processed training data is stored (post filtering/tokenization/masking) |\n| max_seq_len | The maximum sequence length to be included in the training set. Samples exceeding this length will be dropped. |\n| max_batch_len | Maximum tokens per gpu for each batch that will be handled in a single step. Used as part of the multipack calculation. If running into out-of-memory errors, try to lower this value, but not below the `max_seq_len`. |\n| num_epochs | Number of epochs to run through before stopping. |\n| effective_batch_size | The amount of samples in a batch to see before we update the model parameters. |\n| save_samples | Number of samples the model should see before saving a checkpoint. Consider this to be the checkpoint save frequency. |\n| learning_rate | How fast we optimize the weights during gradient descent. Higher values may lead to unstable learning performance. It's generally recommended to have a low learning rate with a high effective batch size. |\n| warmup_steps | The number of steps a model should go through before reaching the full learning rate. We start at 0 and linearly climb up to `learning_rate`. |\n| is_padding_free | Boolean value to indicate whether or not we're training a padding-free transformer model such as Granite. |\n| random_seed | The random seed PyTorch will use. |\n| mock_data | Whether or not to use mock, randomly generated, data during training. For debug purposes |\n| mock_data_len | Max length of a single mock data sample. Equivalent to `max_seq_len` but for mock data. |\n| deepspeed_options | Config options to specify for the DeepSpeed optimizer. |\n| lora | Options to specify if you intend to perform a LoRA train instead of a full fine-tune. |\n| chat_tmpl_path | Specifies the chat template / special tokens for training. |\n| checkpoint_at_epoch | Whether or not we should save a checkpoint at the end of each epoch. |\n| fsdp_options | The settings for controlling FSDP when it's selected as the distributed backend. |\n| distributed_backend | Specifies which distributed training backend to use. Supported options are \"fsdp\" and \"deepspeed\". |\n| disable_flash_attn | Disables flash attention when set to true. This allows for training on older devices. |\n\n### `DeepSpeedOptions`\n\nThis library only currently support a few options in `DeepSpeedOptions`:\nThe default is to run with DeepSpeed, so these options only currently\nallow you to customize aspects of the ZeRO stage 2 optimizer.\n\n| Field | Description |\n| --- | --- |\n| cpu_offload_optimizer | Whether or not to do CPU offloading in DeepSpeed stage 2. |\n| cpu_offload_optimizer_ratio | Floating point between 0 & 1. Specifies the ratio of parameters updating (i.e. optimizer step) on CPU side. |\n| cpu_offload_optimizer_pin_memory | If true, offload to page-locked CPU memory. This could boost throughput at the cost of extra memory overhead. |\n| save_samples | The number of samples to see before saving a DeepSpeed checkpoint. |\n\nFor more information about DeepSpeed, see [deepspeed.ai](https://www.deepspeed.ai/)\n\n#### `FSDPOptions`\n\nLike DeepSpeed, we only expose a number of parameters for you to modify with FSDP.\nThey are listed below:\n\n| Field | Description |\n| --- | --- |\n| cpu_offload_params | When set to true, offload parameters from the accelerator onto the CPU. This is an all-or-nothing option. |\n| sharding_strategy | Specifies the model sharding strategy that FSDP should use. Valid options are: `FULL_SHARD` (ZeRO-3), `HYBRID_SHARD` (ZeRO-3*), `SHARD_GRAD_OP` (ZeRO-2), and `NO_SHARD`. |\n\n> [!NOTE]\n> For `sharding_strategy` - Only `SHARD_GRAD_OP` has been extensively tested and is actively supported by this library.\n\n### `loraOptions`\n\nLoRA options currently supported:\n\n| Field | Description |\n| --- | --- |\n| rank | The rank parameter for LoRA training. |\n| alpha | The alpha parameter for LoRA training. |\n| dropout | The dropout rate for LoRA training. |\n| target_modules | The list of target modules for LoRA training. |\n| quantize_data_type | The data type for quantization in LoRA training. Valid options are `None` and `\"nf4\"` |\n\n#### Example run with LoRa options\n\nIf you'd like to do a LoRA train, you can specify a LoRA\noption to `TrainingArgs` via the `LoraOptions` object.\n\n```python\nfrom instructlab.training import LoraOptions, TrainingArgs\n\ntraining_args = TrainingArgs(\n lora = LoraOptions(\n rank = 4,\n alpha = 32,\n dropout = 0.1,\n ),\n # ...\n)\n```\n\n### Learning about `TorchrunArgs` arguments\n\nWhen running the training script, we always invoke `torchrun`.\n\nIf you are running a single-GPU system or something that doesn't\notherwise require distributed training configuration, you can create a default object:\n\n```python\nrun_training(\n torchrun_args=TorchrunArgs(),\n training_args=TrainingArgs(\n # ...\n ),\n)\n```\n\nHowever, if you want to specify a more complex configuration,\nthe library currently supports all the options that [torchrun accepts\ntoday](https://pytorch.org/docs/stable/elastic/run.html#definitions).\n\n> [!NOTE]\n> For more information about the `torchrun` arguments, please consult the [torchrun documentation](https://pytorch.org/docs/stable/elastic/run.html#definitions).\n\n#### Example training run with `TorchrunArgs` arguments\n\nFor example, in a 8-GPU, 2-machine system, we would\nspecify the following torchrun config:\n\n```python\nMASTER_ADDR = os.getenv('MASTER_ADDR')\nMASTER_PORT = os.getnev('MASTER_PORT')\nRDZV_ENDPOINT = f'{MASTER_ADDR}:{MASTER_PORT}'\n\n# on machine 1\ntorchrun_args = TorchrunArgs(\n nnodes = 2, # number of machines \n nproc_per_node = 4, # num GPUs per machine\n node_rank = 0, # node rank for this machine\n rdzv_id = 123,\n rdzv_endpoint = RDZV_ENDPOINT\n)\n\nrun_training(\n torchrun_args=torchrun_args,\n training_args=training_args\n)\n```\n\n```python\nMASTER_ADDR = os.getenv('MASTER_ADDR')\nMASTER_PORT = os.getnev('MASTER_PORT')\nRDZV_ENDPOINT = f'{MASTER_ADDR}:{MASTER_PORT}'\n\n# on machine 2\ntorchrun_args = TorchrunArgs(\n nnodes = 2, # number of machines \n nproc_per_node = 4, # num GPUs per machine\n node_rank = 1, # node rank for this machine\n rdzv_id = 123,\n rdzv_endpoint = f'{MASTER_ADDR}:{MASTER_PORT}'\n)\n\nrun_training(\n torch_args=torchrun_args,\n train_args=training_args\n)\n```\n\n## Example training run with arguments\n\nDefine the training arguments which will serve as the\nparameters for our training run:\n\n```py\n# define training-specific arguments\ntraining_args = TrainingArgs(\n # define data-specific arguments\n model_path = \"ibm-granite/granite-7b-base\",\n data_path = \"path/to/dataset.jsonl\",\n ckpt_output_dir = \"data/saved_checkpoints\",\n data_output_dir = \"data/outputs\",\n\n # define model-trianing parameters\n max_seq_len = 4096,\n max_batch_len = 60000,\n num_epochs = 10,\n effective_batch_size = 3840,\n save_samples = 250000,\n learning_rate = 2e-6,\n warmup_steps = 800,\n is_padding_free = True, # set this to true when using Granite-based models\n random_seed = 42,\n)\n```\n\nWe'll also need to define the settings for running a multi-process job\nvia `torchrun`. To do this, create a `TorchrunArgs` object.\n\n> [!TIP]\n> Note, for single-GPU jobs, you can simply set `nnodes = 1` and `nproc_per_node=1`.\n\n```py\ntorchrun_args = TorchrunArgs(\n nnodes = 1, # number of machines \n nproc_per_node = 8, # num GPUs per machine\n node_rank = 0, # node rank for this machine\n rdzv_id = 123,\n rdzv_endpoint = '127.0.0.1:12345'\n)\n```\n\nFinally, you can just call `run_training` and this library will handle\nthe rest \ud83d\ude42.\n\n```py\nrun_training(\n torchrun_args=torchrun_args,\n training_args=training_args,\n)\n",
"bugtrack_url": null,
"license": "Apache-2.0",
"summary": "Training Library",
"version": "0.5.5",
"project_urls": {
"homepage": "https://instructlab.ai",
"issues": "https://github.com/instructlab/training/issues",
"source": "https://github.com/instructlab/training"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "4684bd7243d5f1115b18faee1d4ffeb273a649723408bf15d11a13b9f1e591ad",
"md5": "8f444dd73dbbaf6a75a1c388eee60916",
"sha256": "e31cd61002155d0c24cb37635b15ae9cb5df329e523cef304f004aac38b61c2b"
},
"downloads": -1,
"filename": "instructlab_training-0.5.5-py3-none-any.whl",
"has_sig": false,
"md5_digest": "8f444dd73dbbaf6a75a1c388eee60916",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 47055,
"upload_time": "2024-10-22T14:00:25",
"upload_time_iso_8601": "2024-10-22T14:00:25.905198Z",
"url": "https://files.pythonhosted.org/packages/46/84/bd7243d5f1115b18faee1d4ffeb273a649723408bf15d11a13b9f1e591ad/instructlab_training-0.5.5-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "f96b8379b1a9727d3bf72c95570f3bee9936486a631847ba17427667d42c6a03",
"md5": "f502fbad53261c624b8d292202341050",
"sha256": "0260087d5f1da695729b691aacdfecc0f090a9abb1f6cdb3ba7170e4c2905566"
},
"downloads": -1,
"filename": "instructlab_training-0.5.5.tar.gz",
"has_sig": false,
"md5_digest": "f502fbad53261c624b8d292202341050",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 4832400,
"upload_time": "2024-10-22T14:00:27",
"upload_time_iso_8601": "2024-10-22T14:00:27.638585Z",
"url": "https://files.pythonhosted.org/packages/f9/6b/8379b1a9727d3bf72c95570f3bee9936486a631847ba17427667d42c6a03/instructlab_training-0.5.5.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-10-22 14:00:27",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "instructlab",
"github_project": "training",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "packaging",
"specs": [
[
">=",
"20.9"
]
]
},
{
"name": "wheel",
"specs": [
[
">=",
"0.43"
]
]
},
{
"name": "pyyaml",
"specs": []
},
{
"name": "py-cpuinfo",
"specs": []
},
{
"name": "torch",
"specs": [
[
">=",
"2.3.0a0"
]
]
},
{
"name": "transformers",
"specs": [
[
">=",
"4.41.2"
]
]
},
{
"name": "accelerate",
"specs": [
[
">=",
"0.34.2"
]
]
},
{
"name": "datasets",
"specs": [
[
">=",
"2.15.0"
]
]
},
{
"name": "numba",
"specs": []
},
{
"name": "numpy",
"specs": [
[
">=",
"1.23.5"
],
[
"<",
"2.0.0"
]
]
},
{
"name": "numpy",
"specs": [
[
">=",
"1.26.4"
],
[
"<",
"2.0.0"
]
]
},
{
"name": "rich",
"specs": []
},
{
"name": "instructlab-dolomite",
"specs": [
[
">=",
"0.1.1"
]
]
},
{
"name": "trl",
"specs": [
[
">=",
"0.9.4"
]
]
},
{
"name": "peft",
"specs": []
},
{
"name": "pydantic",
"specs": [
[
">=",
"2.7.0"
]
]
},
{
"name": "deepspeed",
"specs": [
[
">=",
"0.14.3"
]
]
},
{
"name": "aiofiles",
"specs": [
[
">=",
"23.2.1"
]
]
}
],
"tox": true,
"lcname": "instructlab-training"
}