sophia-opt

Name	sophia-opt JSON
Version	0.2.2 JSON
	download
home_page	None
Summary	A community package of the official implementation of “Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training”
upload_time	2024-04-17 14:40:18
maintainer	None
docs_url	None
author	Liuhong99
requires_python	>=3.8
license	None
keywords	optimizer pytorch sophia
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            
# sophia-opt

[PyPI](https://pypi.org/project/sophia-opt/)

This package is a fork of the official implementation(
[https://github.com/Liuhong99/Sophia](https://github.com/Liuhong99/Sophia)
) and is simply packaged to improve ease of installation.


### Installation
```sh
pip install sophia-opt
```
or `pip install git+https://github.com/fuyutarow/sophia-opt`

### Usage
```sh
from sophia_opt import SophiaG
```


---  The following content is almost the same as the official readme ---


# Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training


This is an official implementation of the **Sophia-G** optimizer in the paper [https://arxiv.org/abs/2305.14342](https://arxiv.org/abs/2305.14342) and GPT-2 training scripts. The code is based on [nanoGPT](https://github.com/karpathy/nanoGPT/) and [levanter](https://github.com/stanford-crfm/levanter/). Please cite the paper and star this repo if you find Sophia useful. Thanks!


```tex
@article{liu2023sophia,
 title={Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training},
 author={Liu, Hong and Li, Zhiyuan and Hall, David and Liang, Percy and Ma, Tengyu},
 journal={arXiv preprint arXiv:2305.14342},
 year={2023}
}
```


## News and Updates
- Updated results with latest PyTorch version.




## Dependencies


- [PyTorch](https://pytorch.org) 2.1.2
- transformers 4.33.0
- datasets
- tiktoken
- wandb

## General Usage

Below is an example code snippet for training a general model with NLL loss with SophiaG. Please refer to the next section for guidelines on hyperparameter tuning.

```python
import torch
import torch.nn.functional as F
from sophia_opt import SophiaG

# init model loss function and input data
model = Model()
data_loader = ...

# init the optimizer
optimizer = SophiaG(model.parameters(), lr=2e-4, betas=(0.965, 0.99), rho=0.01, weight_decay=1e-1)

total_bs = len(data_loader)
bs = total_bs * block_size
k = 10
iter_num = -1

# training loop
for epoch in range(epochs):
    for X, Y in data_loader:
        # standard training code
        logits, loss = model(X, Y)
        loss.backward()
        optimizer.step(bs=bs)
        optimizer.zero_grad(set_to_none=True)
        iter_num += 1

        if iter_num % k != k - 1:
            continue
        else:
            # update hessian EMA
            logits, _ = model(X, None)
            samp_dist = torch.distributions.Categorical(logits=logits)
            y_sample = samp_dist.sample()
            loss_sampled = F.cross_entropy(logits.view(-1, logits.size(-1)), y_sample.view(-1), ignore_index=-1)
            loss_sampled.backward()
            optimizer.update_hessian()
            optimizer.zero_grad(set_to_none=True)
            model.zero_grad()
```


## Hyper-parameter Tuning

### Definition of learning rate 
- The update in the code is written as $\theta_{t+1} = \theta_t - lr*\textup{clip}(m_t / (\rho * h_t + \epsilon), 1)$, which is equivalent to the update in the paper up to a re-parameterization. (the $lr$ here corresponds to $\rho \cdot \eta_t$ in the paper). As a result, the learning rate of AdamW and Lion is not directly comparable. Empirically, Adam and Lion with learning rate ratio 5:1 has similar behaviour. The learning rate of SophiaG and Lion is directly comparable. Sophia allows to use much larger learning rate the Lion, and this is why Sophia is much faster. 

### Tuning the hyperparameter $\rho$ 
- Tune $\rho$ to make the proportion of the clipped coordinates stable and in a proper range. This is tracked as ```train/win_rate``` in the [GPT-2 training example](https://github.com/Liuhong99/Sophia/blob/2443b03529ecdccf65699a5e55e68d69ede39509/train_sophiag.py#L398C21-L398C65). ```train/win_rate``` should peak in the beginning and remain stable afterwards. ```train/win_rate``` should stay in the range of 0.1 - 0.5. Typically a large $\rho$ will lead to a large ```train/win_rate```. An example of typical ```win_rate``` behavior in T5 model is provided below.

### Tuning the learning rate and weight decay
- Choose lr to be slightly smaller than the learning rate that you would use for AdamW or 3 - 5 times the learning rate that you would use for Lion. 
<p align="center" width="100%">
      <img src="assets/t5_winrate.png" style="width: 60%; min-width: 200px; display: block; margin: auto;">
</p>

- If the loss blows up, slightly decrease the learning rate or increase $\rho$.
  
- Always use about 2x larger weight decay than what you would use for AdamW.

### Hyperparameters for GPT-2 models

- Choose lr to be about the same as the learning rate that you would use for AdamW or 5 - 10 times the learning rate that you would use for Lion.
- Tune $\rho$ to make the proportion of the parameters where the update is not clipped stable and in a proper range. This is tracked as ```train/win_rate``` in the [GPT-2 training example](https://github.com/Liuhong99/Sophia/blob/2443b03529ecdccf65699a5e55e68d69ede39509/train_sophiag.py#L398C21-L398C65). ```train/win_rate``` should peak in the beginning and remain stable afterwards. ```train/win_rate``` should stay in the range of 0.1 - 0.5. Typically a large $\rho$ will lead to a large ```train/win_rate```.
- Use slightly larger weight decay than AdamW, e.g. 0.2.
- Except learning rate, all other hyperparameters are transferable across different model sizes.
- See the table below for the hyperparameters for different model sizes.

| Model Size  | lr for Adam | lr for Lion | lr for Sophia | $\rho$ for Sophia | weight decay for Sophia |
| -------- | ------- | ------- | ------- | ------- | ------- |
| 125M | 6e-4 | 1e-4 | 6e-4 | 0.05 | 0.2 |
| 355M | 3e-4 | 1e-4 | 7e-4 | 0.08 | 0.2 |
| 770M | 2e-4 | 8e-5 | 3e-4 | 0.05 | 0.2 |

- Please feel free to let us know what you find out during hyper-parameters tuning. We appreciate your valuable feedback and comments!

## Reproduce GPT-2 Results

Prepare the [OpenWebText](https://huggingface.co/datasets/openwebtext) data following [nanoGPT](https://github.com/karpathy/nanoGPT/):
```
$ python data/openwebtext/prepare.py
```
Start pre-training GPT2 Small (125M):

If you have a machine with 10 A5000 (24GB) GPUs,
```
$ torchrun --standalone --nproc_per_node=10 \
      train_sophiag.py \
      config/train_gpt2_small_sophiag.py \
      --batch_size=8 \
      --gradient_accumulation_steps=6
```
If you have a machine with 8 A100 (40GB) GPUs,
```
$ torchrun --standalone --nproc_per_node=8 \
      train_sophiag.py \
      config/train_gpt2_small_sophiag.py \
      --batch_size=12 \
      --gradient_accumulation_steps=5
```

To reproduce the AdamW baseline following [nanoGPT](https://github.com/karpathy/nanoGPT/):
```
$ torchrun --standalone --nproc_per_node=10 \
      train_adam.py \
      config/train_gpt2_small_adam.py \
      --batch_size=8 \
      --gradient_accumulation_steps=6
```

This will lead to results in the figure below:
<p align="center" width="100%">
      <img src="assets/small_100k_plus.png" style="width: 60%; min-width: 200px; display: block; margin: auto;">
</p>

Start pre-training GPT2 Medium (355M):

If you have a machine with 8 A100 (40GB) GPUs,
```
$ torchrun --standalone --nproc_per_node=8 \
      train_sophiag.py \
      config/train_gpt2_medium_sophiag.py \
      --batch_size=6 \
      --gradient_accumulation_steps=10
```

To reproduce the AdamW baseline:
```
$ torchrun --standalone --nproc_per_node=8 \
      train_adam.py \
      config/train_gpt2_medium_adam.py \
      --batch_size=6 \
      --gradient_accumulation_steps=10
```

Please adjust ```nproc_per_node```, ```batch_size```, and ```gradient_accumulation_steps``` accordingly if you use other hardware setup. Make sure their product equals 480.


This will lead to results in the figure below:
<p align="center" width="100%">
      <img src="assets/medium_100k_plus.png" style="width: 60%; min-width: 200px; display: block; margin: auto;">
</p>

Start pre-training GPT2 1.5B:

We use [the Pile](https://github.com/EleutherAI/the-pile) and GPT NeoX tokenizer. First set up TPU instances and environment following [levanter](https://github.com/stanford-crfm/levanter/blob/e183ec80ec5971b12d4a3fb08a160268de342670/docs/Getting-Started-TPU-VM.md). Then change GAMMA_SOPHIA_G to 200 in [optim.py](https://github.com/stanford-crfm/levanter/blob/e183ec80ec5971b12d4a3fb08a160268de342670/src/levanter/optim.py). The training script for 1.5B model is 
```
gcloud compute tpus tpu-vm ssh <instance_name> \
      --zone <zone_name> \
      --worker=all \
      --command 'WANDB_API_KEY=<wandb_api_key> levanter/infra/launch.sh python levanter/examples/gpt2_example.py --config_path levanter/config/gpt2_1536_pile.yaml --trainer.beta1 0.965 --trainer.beta2 0.99 --trainer.min_lr_ratio 0.020 --trainer.weight_decay 0.15 --trainer.learning_rate 2.5e-4 --trainer.warmup_ratio 0.01'

```

## Acknowledgement

The GPT-2 training code is based on [nanoGPT](https://github.com/karpathy/nanoGPT/), which is elegant and super efficient.

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "sophia-opt",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "optimizer, pytorch, sophia",
    "author": "Liuhong99",
    "author_email": "fuyutarow <fuyutarow@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/01/c1/9fbefd46dafb099935ea81b76f71bb4cb930e91418fde4f423550d74d37d/sophia_opt-0.2.2.tar.gz",
    "platform": null,
    "description": "\n# sophia-opt\n\n[PyPI](https://pypi.org/project/sophia-opt/)\n\nThis package is a fork of the official implementation(\n[https://github.com/Liuhong99/Sophia](https://github.com/Liuhong99/Sophia)\n) and is simply packaged to improve ease of installation.\n\n\n### Installation\n```sh\npip install sophia-opt\n```\nor `pip install git+https://github.com/fuyutarow/sophia-opt`\n\n### Usage\n```sh\nfrom sophia_opt import SophiaG\n```\n\n\n---  The following content is almost the same as the official readme ---\n\n\n# Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training\n\n\nThis is an official implementation of the **Sophia-G** optimizer in the paper [https://arxiv.org/abs/2305.14342](https://arxiv.org/abs/2305.14342) and GPT-2 training scripts. The code is based on [nanoGPT](https://github.com/karpathy/nanoGPT/) and [levanter](https://github.com/stanford-crfm/levanter/). Please cite the paper and star this repo if you find Sophia useful. Thanks!\n\n\n```tex\n@article{liu2023sophia,\n title={Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training},\n author={Liu, Hong and Li, Zhiyuan and Hall, David and Liang, Percy and Ma, Tengyu},\n journal={arXiv preprint arXiv:2305.14342},\n year={2023}\n}\n```\n\n\n## News and Updates\n- Updated results with latest PyTorch version.\n\n\n\n\n## Dependencies\n\n\n- [PyTorch](https://pytorch.org) 2.1.2\n- transformers 4.33.0\n- datasets\n- tiktoken\n- wandb\n\n## General Usage\n\nBelow is an example code snippet for training a general model with NLL loss with SophiaG. Please refer to the next section for guidelines on hyperparameter tuning.\n\n```python\nimport torch\nimport torch.nn.functional as F\nfrom sophia_opt import SophiaG\n\n# init model loss function and input data\nmodel = Model()\ndata_loader = ...\n\n# init the optimizer\noptimizer = SophiaG(model.parameters(), lr=2e-4, betas=(0.965, 0.99), rho=0.01, weight_decay=1e-1)\n\ntotal_bs = len(data_loader)\nbs = total_bs * block_size\nk = 10\niter_num = -1\n\n# training loop\nfor epoch in range(epochs):\n    for X, Y in data_loader:\n        # standard training code\n        logits, loss = model(X, Y)\n        loss.backward()\n        optimizer.step(bs=bs)\n        optimizer.zero_grad(set_to_none=True)\n        iter_num += 1\n\n        if iter_num % k != k - 1:\n            continue\n        else:\n            # update hessian EMA\n            logits, _ = model(X, None)\n            samp_dist = torch.distributions.Categorical(logits=logits)\n            y_sample = samp_dist.sample()\n            loss_sampled = F.cross_entropy(logits.view(-1, logits.size(-1)), y_sample.view(-1), ignore_index=-1)\n            loss_sampled.backward()\n            optimizer.update_hessian()\n            optimizer.zero_grad(set_to_none=True)\n            model.zero_grad()\n```\n\n\n## Hyper-parameter Tuning\n\n### Definition of learning rate \n- The update in the code is written as $\\theta_{t+1} = \\theta_t - lr*\\textup{clip}(m_t / (\\rho * h_t + \\epsilon), 1)$, which is equivalent to the update in the paper up to a re-parameterization. (the $lr$ here corresponds to $\\rho \\cdot \\eta_t$ in the paper). As a result, the learning rate of AdamW and Lion is not directly comparable. Empirically, Adam and Lion with learning rate ratio 5:1 has similar behaviour. The learning rate of SophiaG and Lion is directly comparable. Sophia allows to use much larger learning rate the Lion, and this is why Sophia is much faster. \n\n### Tuning the hyperparameter $\\rho$ \n- Tune $\\rho$ to make the proportion of the clipped coordinates stable and in a proper range. This is tracked as ```train/win_rate``` in the [GPT-2 training example](https://github.com/Liuhong99/Sophia/blob/2443b03529ecdccf65699a5e55e68d69ede39509/train_sophiag.py#L398C21-L398C65). ```train/win_rate``` should peak in the beginning and remain stable afterwards. ```train/win_rate``` should stay in the range of 0.1 - 0.5. Typically a large $\\rho$ will lead to a large ```train/win_rate```. An example of typical ```win_rate``` behavior in T5 model is provided below.\n\n### Tuning the learning rate and weight decay\n- Choose lr to be slightly smaller than the learning rate that you would use for AdamW or 3 - 5 times the learning rate that you would use for Lion. \n<p align=\"center\" width=\"100%\">\n      <img src=\"assets/t5_winrate.png\" style=\"width: 60%; min-width: 200px; display: block; margin: auto;\">\n</p>\n\n- If the loss blows up, slightly decrease the learning rate or increase $\\rho$.\n  \n- Always use about 2x larger weight decay than what you would use for AdamW.\n\n### Hyperparameters for GPT-2 models\n\n- Choose lr to be about the same as the learning rate that you would use for AdamW or 5 - 10 times the learning rate that you would use for Lion.\n- Tune $\\rho$ to make the proportion of the parameters where the update is not clipped stable and in a proper range. This is tracked as ```train/win_rate``` in the [GPT-2 training example](https://github.com/Liuhong99/Sophia/blob/2443b03529ecdccf65699a5e55e68d69ede39509/train_sophiag.py#L398C21-L398C65). ```train/win_rate``` should peak in the beginning and remain stable afterwards. ```train/win_rate``` should stay in the range of 0.1 - 0.5. Typically a large $\\rho$ will lead to a large ```train/win_rate```.\n- Use slightly larger weight decay than AdamW, e.g. 0.2.\n- Except learning rate, all other hyperparameters are transferable across different model sizes.\n- See the table below for the hyperparameters for different model sizes.\n\n| Model Size  | lr for Adam | lr for Lion | lr for Sophia | $\\rho$ for Sophia | weight decay for Sophia |\n| -------- | ------- | ------- | ------- | ------- | ------- |\n| 125M | 6e-4 | 1e-4 | 6e-4 | 0.05 | 0.2 |\n| 355M | 3e-4 | 1e-4 | 7e-4 | 0.08 | 0.2 |\n| 770M | 2e-4 | 8e-5 | 3e-4 | 0.05 | 0.2 |\n\n- Please feel free to let us know what you find out during hyper-parameters tuning. We appreciate your valuable feedback and comments!\n\n## Reproduce GPT-2 Results\n\nPrepare the [OpenWebText](https://huggingface.co/datasets/openwebtext) data following [nanoGPT](https://github.com/karpathy/nanoGPT/):\n```\n$ python data/openwebtext/prepare.py\n```\nStart pre-training GPT2 Small (125M):\n\nIf you have a machine with 10 A5000 (24GB) GPUs,\n```\n$ torchrun --standalone --nproc_per_node=10 \\\n      train_sophiag.py \\\n      config/train_gpt2_small_sophiag.py \\\n      --batch_size=8 \\\n      --gradient_accumulation_steps=6\n```\nIf you have a machine with 8 A100 (40GB) GPUs,\n```\n$ torchrun --standalone --nproc_per_node=8 \\\n      train_sophiag.py \\\n      config/train_gpt2_small_sophiag.py \\\n      --batch_size=12 \\\n      --gradient_accumulation_steps=5\n```\n\nTo reproduce the AdamW baseline following [nanoGPT](https://github.com/karpathy/nanoGPT/):\n```\n$ torchrun --standalone --nproc_per_node=10 \\\n      train_adam.py \\\n      config/train_gpt2_small_adam.py \\\n      --batch_size=8 \\\n      --gradient_accumulation_steps=6\n```\n\nThis will lead to results in the figure below:\n<p align=\"center\" width=\"100%\">\n      <img src=\"assets/small_100k_plus.png\" style=\"width: 60%; min-width: 200px; display: block; margin: auto;\">\n</p>\n\nStart pre-training GPT2 Medium (355M):\n\nIf you have a machine with 8 A100 (40GB) GPUs,\n```\n$ torchrun --standalone --nproc_per_node=8 \\\n      train_sophiag.py \\\n      config/train_gpt2_medium_sophiag.py \\\n      --batch_size=6 \\\n      --gradient_accumulation_steps=10\n```\n\nTo reproduce the AdamW baseline:\n```\n$ torchrun --standalone --nproc_per_node=8 \\\n      train_adam.py \\\n      config/train_gpt2_medium_adam.py \\\n      --batch_size=6 \\\n      --gradient_accumulation_steps=10\n```\n\nPlease adjust ```nproc_per_node```, ```batch_size```, and ```gradient_accumulation_steps``` accordingly if you use other hardware setup. Make sure their product equals 480.\n\n\nThis will lead to results in the figure below:\n<p align=\"center\" width=\"100%\">\n      <img src=\"assets/medium_100k_plus.png\" style=\"width: 60%; min-width: 200px; display: block; margin: auto;\">\n</p>\n\nStart pre-training GPT2 1.5B:\n\nWe use [the Pile](https://github.com/EleutherAI/the-pile) and GPT NeoX tokenizer. First set up TPU instances and environment following [levanter](https://github.com/stanford-crfm/levanter/blob/e183ec80ec5971b12d4a3fb08a160268de342670/docs/Getting-Started-TPU-VM.md). Then change GAMMA_SOPHIA_G to 200 in [optim.py](https://github.com/stanford-crfm/levanter/blob/e183ec80ec5971b12d4a3fb08a160268de342670/src/levanter/optim.py). The training script for 1.5B model is \n```\ngcloud compute tpus tpu-vm ssh <instance_name> \\\n      --zone <zone_name> \\\n      --worker=all \\\n      --command 'WANDB_API_KEY=<wandb_api_key> levanter/infra/launch.sh python levanter/examples/gpt2_example.py --config_path levanter/config/gpt2_1536_pile.yaml --trainer.beta1 0.965 --trainer.beta2 0.99 --trainer.min_lr_ratio 0.020 --trainer.weight_decay 0.15 --trainer.learning_rate 2.5e-4 --trainer.warmup_ratio 0.01'\n\n```\n\n## Acknowledgement\n\nThe GPT-2 training code is based on [nanoGPT](https://github.com/karpathy/nanoGPT/), which is elegant and super efficient.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A community package of the official implementation of \u201cSophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training\u201d",
    "version": "0.2.2",
    "project_urls": {
        "Homepage": "https://github.com/fuyutarow/sophia-opt",
        "Repository": "https://github.com/fuyutarow/sophia-opt"
    },
    "split_keywords": [
        "optimizer",
        " pytorch",
        " sophia"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "1ca407da6c5c22e19afa4824c566a364e6e77a26dd01e6646bfdaab4cfa50952",
                "md5": "98da900c9d68c687185f392c2bd6be6f",
                "sha256": "de82866023ba900ff1a6d6a77606f164f2f1c59d611546d4125c0f3eed99ef70"
            },
            "downloads": -1,
            "filename": "sophia_opt-0.2.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "98da900c9d68c687185f392c2bd6be6f",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 6944,
            "upload_time": "2024-04-17T14:40:15",
            "upload_time_iso_8601": "2024-04-17T14:40:15.662933Z",
            "url": "https://files.pythonhosted.org/packages/1c/a4/07da6c5c22e19afa4824c566a364e6e77a26dd01e6646bfdaab4cfa50952/sophia_opt-0.2.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "01c19fbefd46dafb099935ea81b76f71bb4cb930e91418fde4f423550d74d37d",
                "md5": "ffb1cdf0890ad70ea25022368c40e61f",
                "sha256": "10e1c4c11b47eb27604f0076caadd3f18751b17233742fd62182d49c969ceeb2"
            },
            "downloads": -1,
            "filename": "sophia_opt-0.2.2.tar.gz",
            "has_sig": false,
            "md5_digest": "ffb1cdf0890ad70ea25022368c40e61f",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 351174,
            "upload_time": "2024-04-17T14:40:18",
            "upload_time_iso_8601": "2024-04-17T14:40:18.264047Z",
            "url": "https://files.pythonhosted.org/packages/01/c1/9fbefd46dafb099935ea81b76f71bb4cb930e91418fde4f423550d74d37d/sophia_opt-0.2.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-04-17 14:40:18",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "fuyutarow",
    "github_project": "sophia-opt",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "sophia-opt"
}

Liuhong99