sgd-boost


Namesgd-boost JSON
Version 1.0.2 PyPI version JSON
download
home_pagehttps://github.com/AnonymousAlethiometer/SGD_Boost
SummarySGD-Boost Optimizer Implementation, designed for pytorch specificly.
upload_time2024-11-28 21:07:17
maintainerNone
docs_urlNone
authorAnonymous
requires_python>=3.6.0
licenseApache Software License
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            
# SGD_Boost

This repository contains the official PyTorch implementation of the paper <em>"Why Transformers Don’t Need Adam: Scale Is All You Need"</em> at [arxiv](https://arxiv.org/abs/), providing a runtime and memory efficient optimizer for training large language models and diffusion models, named **"SGD-Boost"**.

Adaptive gradient methods like Adam and AdamW are popular for training transformer-based models due to their strong performance, but they have significant memory overhead and require extensive hyperparameter tuning. In this work, we argue that these adaptive methods are not necessary for effective training.

SGD-Boost introduces a learning rate rescaling strategy based on initial gradient patterns, applied to stochastic gradient descent with momentum (SGDM). This simple modification allows SGDM to achieve performance levels comparable to AdamW while reducing memory consumption and execution time. By removing the need to store second-order momentum terms, our approach reduces optimizer state memory by half, providing a “free lunch” in training efficiency.

Our method also enhances robustness to variations in learning rate and weight decay during ViT training on the Imagenet-1K task. Experimental results show that it outperforms existing optimizers in LoRA training for both large language models (LLMs) and diffusion models (DMs). Specifically, it enables full precision (FP32) training of GPT-2 (1.5B parameters) on a single RTX3090 and Llama2-7B on an A100-80G GPU. Code is now available at [GitHub](https://github.com/AnonymousAlethiometer/SGD_Boost).


<!-- [Overview](./figures/overview.svg) -->
<p align='center'>
    <img src="./figures/overview.svg" height='300px'/>
    <br/>
    <em><b>Figure1:</b> We analyze four key parameters: the weights of the Query, Key, and Value (QKV) in the first Transformer block; the normalization layer; the fully connected layers within that block; and the final MLP head layer. The gradient signal-to-noise ratio (g-SNR) differs across various parameter groups but remains stable throughout the training process. We utilize this signal to create a scaling strategy that adjusts the fixed learning rates in Stochastic Gradient Descent (SGD).</em>
</p>
<p align='center'>
    <img src="./figures/optimizer_memory_comparison.svg" height='300px'/>
    <img src="./figures/3d_scatter.svg" height='300px'/>
    <br/>
    <em>
    <b>Figure2:</b>
    <b>Left:</b>The figure shows the significant memory overhead for optimizer states with increasing model sizes. SGD-Boost maintains a much lower memory usage compared to other optimizers.
    <b>Right:</b>This figure displays the results from a grid search conducted on the classic ResNet18 model using the CIFAR10 dataset. The maximum top-1 test accuracy is highlighted in red text. Our method surpasses other popular optimizers, achieving the highest test accuracy.
    </em>
</p>
<p align='center'>
    <img src="./figures/algorithm_pseudocode.png" height='300px'/>
    <br/>
    <em>
    <b>Figure3:</b> The pseudocode of the SGD-Boost optimizer.
    </em>
</p>

<!-- [Memory Comparison](./figures/optimizer_memory_comparison.svg) -->
<!-- <img src="./figures/optimizer_memory_comparison.svg"/> -->
<!-- [Stability & Perfomance with Hyperparameters Changes](./figures/3d_scatter.svg) -->
<!-- <img src="./figures/3d_scatter.svg" /> -->

<!-- <img src='./figures/algorithm_pseudocode.png'/> -->

## How To Use

### Installation
Prerequisites:
- Python >= 3.6
- PyTorch >= 1.7.0

Since most of this optimizer uses the native PyTorch APIs, it should have a wider compatibility with different versions of PyTorch. However, we recommend using the Pytorch 2.X version for better performance and compatibility.

Install from PyPI:
```bash
pip install sgd-boost
```

Install from the source code:
```bash
git clone https://github.com/AnonymousAlethiometer/SGD_Boost.git

cd SGD_Boost

# you can use the flag "--use-feature=in-tree-build" to avoid the warning of "FutureWarning: The 'build' command is deprecated"
pip install . --use-feature=in-tree-build

# [Optional] Or you can use '-e' flag to install in editable mode
pip install -e . --use-feature=in-tree-build
```


### Usgae of the optimizer:

The optimizer is normally used in the following way:

```python
from sgd_boost import SGDBoost

# initialize the optimizer
optimizer = SGD_boost(model.parameters(), lr=lr, momentum=0.9, eps=1e-08, weight_decay=weight_decay)

for _ in range(steps):
    pred = model(input_ids)
    loss = loss_fn(pred, labels)
    # calculate the gradient
    loss.backward()
    # process the warmup step, only need once after the gradient is calculated
    if not hasattr(optimizer, 'has_warmup') and hasattr(optimizer, 'warmup_step'):
        optimizer.warmup_step()
        optimizer.has_warmup = True
    # update the parameters
    optimizer.step()
    optimizer.zero_grad(set_to_none=True)
```

### Distributed Training

For distributed training, you need to ensure to perform this g-SNR calculation (refer as the `.warmup step()`) on each worker. Even if you accidentally perform it multiple times, it will not affect the final result thanks to the stability of the g-SNR values. Feel free to use the optimizer in your own training scripts. 

In most circumstances, you only need to replace the original optimizer with our optimizer, perform the `.warmup step()` after first gradient calculation (aka. the first effective invoke of `loss.backwards()`) and keep the rest of the code unchanged.


## Example:

The CNN examples lie in the `examples` directory. It contains the training code for CNN models, as well as the profiling code for the optimizer perfomance evaluation.

Please follow the README in that directory will guide you to restore the environment. Due to the procedure of anonymization, although the main part has been leaved unchanged, some of the components may not be available, try to delete the unavailable resources as needed. 

The ViT example will be released soon.


## Acknowledgement
1. The codebase is based on the [timm:pytorch-image-models](https://github.com/huggingface/pytorch-image-models)(ViT training example, release soon), [NanoGPT](https://github.com/karpathy/nanoGPT) and [Adam-mini](https://github.com/zyushun/Adam-mini)(GPT2 training) repository.

2. We thanks for [Pytorch Profiler](https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html) for an accurate and efficient way to profile the memory usage of the optimizer.


## Citation
If you find this work helpful, please consider citing our paper:
```
@article{XXXXXXXXXX,
  title={Why Transformers Don’t Need Adam: Scale Is All You Need},
  author={Anonymous},
  journal={arXiv preprint arXiv:24XX.XXXXX},
  year={2024}
}
```



            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/AnonymousAlethiometer/SGD_Boost",
    "name": "sgd-boost",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.6.0",
    "maintainer_email": null,
    "keywords": null,
    "author": "Anonymous",
    "author_email": "anonymous@anonymous.com",
    "download_url": "https://files.pythonhosted.org/packages/12/46/a8b3704643c692a90b9a6562902f44ea191009a49eec903f3454c782b685/sgd_boost-1.0.2.tar.gz",
    "platform": null,
    "description": "\n# SGD_Boost\n\nThis repository contains the official PyTorch implementation of the paper <em>\"Why Transformers Don\u2019t Need Adam: Scale Is All You Need\"</em> at [arxiv](https://arxiv.org/abs/), providing a runtime and memory efficient optimizer for training large language models and diffusion models, named **\"SGD-Boost\"**.\n\nAdaptive gradient methods like Adam and AdamW are popular for training transformer-based models due to their strong performance, but they have significant memory overhead and require extensive hyperparameter tuning. In this work, we argue that these adaptive methods are not necessary for effective training.\n\nSGD-Boost introduces a learning rate rescaling strategy based on initial gradient patterns, applied to stochastic gradient descent with momentum (SGDM). This simple modification allows SGDM to achieve performance levels comparable to AdamW while reducing memory consumption and execution time. By removing the need to store second-order momentum terms, our approach reduces optimizer state memory by half, providing a \u201cfree lunch\u201d in training efficiency.\n\nOur method also enhances robustness to variations in learning rate and weight decay during ViT training on the Imagenet-1K task. Experimental results show that it outperforms existing optimizers in LoRA training for both large language models (LLMs) and diffusion models (DMs). Specifically, it enables full precision (FP32) training of GPT-2 (1.5B parameters) on a single RTX3090 and Llama2-7B on an A100-80G GPU. Code is now available at [GitHub](https://github.com/AnonymousAlethiometer/SGD_Boost).\n\n\n<!-- [Overview](./figures/overview.svg) -->\n<p align='center'>\n    <img src=\"./figures/overview.svg\" height='300px'/>\n    <br/>\n    <em><b>Figure1:</b> We analyze four key parameters: the weights of the Query, Key, and Value (QKV) in the first Transformer block; the normalization layer; the fully connected layers within that block; and the final MLP head layer. The gradient signal-to-noise ratio (g-SNR) differs across various parameter groups but remains stable throughout the training process. We utilize this signal to create a scaling strategy that adjusts the fixed learning rates in Stochastic Gradient Descent (SGD).</em>\n</p>\n<p align='center'>\n    <img src=\"./figures/optimizer_memory_comparison.svg\" height='300px'/>\n    <img src=\"./figures/3d_scatter.svg\" height='300px'/>\n    <br/>\n    <em>\n    <b>Figure2:</b>\n    <b>Left:</b>The figure shows the significant memory overhead for optimizer states with increasing model sizes. SGD-Boost maintains a much lower memory usage compared to other optimizers.\n    <b>Right:</b>This figure displays the results from a grid search conducted on the classic ResNet18 model using the CIFAR10 dataset. The maximum top-1 test accuracy is highlighted in red text. Our method surpasses other popular optimizers, achieving the highest test accuracy.\n    </em>\n</p>\n<p align='center'>\n    <img src=\"./figures/algorithm_pseudocode.png\" height='300px'/>\n    <br/>\n    <em>\n    <b>Figure3:</b> The pseudocode of the SGD-Boost optimizer.\n    </em>\n</p>\n\n<!-- [Memory Comparison](./figures/optimizer_memory_comparison.svg) -->\n<!-- <img src=\"./figures/optimizer_memory_comparison.svg\"/> -->\n<!-- [Stability & Perfomance with Hyperparameters Changes](./figures/3d_scatter.svg) -->\n<!-- <img src=\"./figures/3d_scatter.svg\" /> -->\n\n<!-- <img src='./figures/algorithm_pseudocode.png'/> -->\n\n## How To Use\n\n### Installation\nPrerequisites:\n- Python >= 3.6\n- PyTorch >= 1.7.0\n\nSince most of this optimizer uses the native PyTorch APIs, it should have a wider compatibility with different versions of PyTorch. However, we recommend using the Pytorch 2.X version for better performance and compatibility.\n\nInstall from PyPI:\n```bash\npip install sgd-boost\n```\n\nInstall from the source code:\n```bash\ngit clone https://github.com/AnonymousAlethiometer/SGD_Boost.git\n\ncd SGD_Boost\n\n# you can use the flag \"--use-feature=in-tree-build\" to avoid the warning of \"FutureWarning: The 'build' command is deprecated\"\npip install . --use-feature=in-tree-build\n\n# [Optional] Or you can use '-e' flag to install in editable mode\npip install -e . --use-feature=in-tree-build\n```\n\n\n### Usgae of the optimizer:\n\nThe optimizer is normally used in the following way:\n\n```python\nfrom sgd_boost import SGDBoost\n\n# initialize the optimizer\noptimizer = SGD_boost(model.parameters(), lr=lr, momentum=0.9, eps=1e-08, weight_decay=weight_decay)\n\nfor _ in range(steps):\n    pred = model(input_ids)\n    loss = loss_fn(pred, labels)\n    # calculate the gradient\n    loss.backward()\n    # process the warmup step, only need once after the gradient is calculated\n    if not hasattr(optimizer, 'has_warmup') and hasattr(optimizer, 'warmup_step'):\n        optimizer.warmup_step()\n        optimizer.has_warmup = True\n    # update the parameters\n    optimizer.step()\n    optimizer.zero_grad(set_to_none=True)\n```\n\n### Distributed Training\n\nFor distributed training, you need to ensure to perform this g-SNR calculation (refer as the `.warmup step()`) on each worker. Even if you accidentally perform it multiple times, it will not affect the final result thanks to the stability of the g-SNR values. Feel free to use the optimizer in your own training scripts. \n\nIn most circumstances, you only need to replace the original optimizer with our optimizer, perform the `.warmup step()` after first gradient calculation (aka. the first effective invoke of `loss.backwards()`) and keep the rest of the code unchanged.\n\n\n## Example:\n\nThe CNN examples lie in the `examples` directory. It contains the training code for CNN models, as well as the profiling code for the optimizer perfomance evaluation.\n\nPlease follow the README in that directory will guide you to restore the environment. Due to the procedure of anonymization, although the main part has been leaved unchanged, some of the components may not be available, try to delete the unavailable resources as needed. \n\nThe ViT example will be released soon.\n\n\n## Acknowledgement\n1. The codebase is based on the [timm:pytorch-image-models](https://github.com/huggingface/pytorch-image-models)(ViT training example, release soon), [NanoGPT](https://github.com/karpathy/nanoGPT) and [Adam-mini](https://github.com/zyushun/Adam-mini)(GPT2 training) repository.\n\n2. We thanks for [Pytorch Profiler](https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html) for an accurate and efficient way to profile the memory usage of the optimizer.\n\n\n## Citation\nIf you find this work helpful, please consider citing our paper:\n```\n@article{XXXXXXXXXX,\n  title={Why Transformers Don\u2019t Need Adam: Scale Is All You Need},\n  author={Anonymous},\n  journal={arXiv preprint arXiv:24XX.XXXXX},\n  year={2024}\n}\n```\n\n\n",
    "bugtrack_url": null,
    "license": "Apache Software License",
    "summary": "SGD-Boost Optimizer Implementation, designed for pytorch specificly.",
    "version": "1.0.2",
    "project_urls": {
        "Homepage": "https://github.com/AnonymousAlethiometer/SGD_Boost"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "564ef3c426968366251644937501ef733c35f5a575de9f555de9594c8ee2ad51",
                "md5": "0cdfb99b80ee7dbaff3012f36337dc72",
                "sha256": "85ef2515a4c271ce619923def334623b4c617e56de970ee0140ef092d4e68d84"
            },
            "downloads": -1,
            "filename": "sgd_boost-1.0.2-py2.py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "0cdfb99b80ee7dbaff3012f36337dc72",
            "packagetype": "bdist_wheel",
            "python_version": "py2.py3",
            "requires_python": ">=3.6.0",
            "size": 10564,
            "upload_time": "2024-11-28T21:07:16",
            "upload_time_iso_8601": "2024-11-28T21:07:16.562892Z",
            "url": "https://files.pythonhosted.org/packages/56/4e/f3c426968366251644937501ef733c35f5a575de9f555de9594c8ee2ad51/sgd_boost-1.0.2-py2.py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "448ec600cb9b9c7418119100a2248aab9eebecbeb45bae82276691000a8d7386",
                "md5": "ad697cf199ac477a53cc37cc8c7d234b",
                "sha256": "762a70a1d3b41dc0373abc6bd01a595950f246202b834253d690c37596eeb1ca"
            },
            "downloads": -1,
            "filename": "sgd_boost-1.0.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "ad697cf199ac477a53cc37cc8c7d234b",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6.0",
            "size": 10560,
            "upload_time": "2024-11-28T21:02:07",
            "upload_time_iso_8601": "2024-11-28T21:02:07.069740Z",
            "url": "https://files.pythonhosted.org/packages/44/8e/c600cb9b9c7418119100a2248aab9eebecbeb45bae82276691000a8d7386/sgd_boost-1.0.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "1246a8b3704643c692a90b9a6562902f44ea191009a49eec903f3454c782b685",
                "md5": "6d75711ad7288aecec271dc215008290",
                "sha256": "c75b6e67f2a6aa4f28149dfdb33f2f2290d0d396140986a7b0be36a68ca2b87b"
            },
            "downloads": -1,
            "filename": "sgd_boost-1.0.2.tar.gz",
            "has_sig": false,
            "md5_digest": "6d75711ad7288aecec271dc215008290",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6.0",
            "size": 11198,
            "upload_time": "2024-11-28T21:07:17",
            "upload_time_iso_8601": "2024-11-28T21:07:17.523587Z",
            "url": "https://files.pythonhosted.org/packages/12/46/a8b3704643c692a90b9a6562902f44ea191009a49eec903f3454c782b685/sgd_boost-1.0.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-11-28 21:07:17",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "AnonymousAlethiometer",
    "github_project": "SGD_Boost",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "sgd-boost"
}
        
Elapsed time: 0.63576s