came-pytorch

Name	came-pytorch JSON
Version	0.1.3 JSON
	download
home_page	https://github.com/yangluo7/CAME/
Summary	CAME Optimizer - Pytorch Version
upload_time	2024-02-05 13:49:57
maintainer
docs_url	None
author	Yang Luo
requires_python	>=3.6
license	MIT
keywords	artificial intelligence deep learning optimizers memory efficient
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            <h1 align="center">CAME Optimizer</h1>
<h3 align="center">ACL 2023 Outstanding Paper Award<br/>Confidence-guided Adaptive Memory Efficient Optimization</h3>


This is an official implementation of **CAME** optimizer in the "[Confidence-guided Adaptive Memory Efficient Optimization](https://arxiv.org/abs/2307.02047)". Please cite the paper and star this repo if you find CAME useful. Thanks!

[Paper](https://arxiv.org/abs/2307.02047) | [Twitter](https://twitter.com/ZangweiZheng/status/1680227732788236289) | [Blog](https://zhengzangw.github.io/blogs/came) | [Pypi Package](https://pypi.org/project/came-pytorch/) | [zhihu](https://zhuanlan.zhihu.com/p/643816029)
## Method

In this work, we studied a confidence-guided strategy to reduce the instability of existing memory efficient optimizers. Based on this strategy, we proposed CAME to simultaneously achieve two goals: fast convergence as in traditional adaptive methods, and low memory usage as in memory-efficient methods.

The pseudo code is presented in the figure with difference with Adafactor in blue fonts.

<p align="center">
<img src="assets/came_code.png" alt="CAME optimizer pseudo code" width="50%" />
</p>
<!-- ![CAME_code](assets/came_code.png) -->

## Install 
```
pip install came-pytorch
```
## Usage

```python
from came_pytorch import CAME
optimizer = CAME(
    model.parameters(),
    lr=2e-4,
    weight_decay=1e-2,
    betas=(0.9, 0.999, 0.9999),
    eps=(1e-30, 1e-16)
)
```

## Hyper-parameter Tuning

* Pre-training: Based on our experiments on BERT-Large, GPT-2 and T5, it's suitable to choose a learning rate for CAME 3-1x smaller than that for AdamW.
* Consider choosing $\beta_3$ between $[0.9995, 0.99995]$ if setting $\beta_1, \beta_2=0.9, 0.999$. Due to computational resource constraints, we did not explore more combinations of three betas. Different training tasks may require different combinations of optimal performance.
* If you have any feedback or comments regarding hyper-parameter tuning, please do not hesitate to provide them to us!

## Experiments

Apart from the BERT and T5 experiments shown in the paper, we conduct more and record the results here.

### Fine-tuning LLaMA-7B

|                | MMLU      | WikiText | HellaSwag | TruthfulQA (MC) | BoolQ     | COPA      | WSC       | WIC       |
| -------------- | --------- | -------- | --------- | --------------- | --------- | --------- | --------- | --------- |
| Alpaca-7B      | 40.21     | 6.74     | 59.76     | **38.89**       | **79.57** | **88.00** | 46.15     | 49.84     |
| Alpaca-7B-CAME | **40.59** | **6.38** | **59.80** | 38.61           | 79.08     | **88.00** | **49.04** | **50.78** |

We fine-tuned LLaMA-7B with [stanford-alpaca](https://github.com/tatsu-lab/stanford_alpaca) (52k instruction-tuning dataset). To replicate our result, first register the CAME optimizer to the transformer package. Then in Alpaca training script, change the default optimizer from "adamw" to "came".

Alpaca-7B and Alpaca-7B-CAME are evaluated using [Instruct-eval](https://github.com/declare-lab/instruct-eval) and [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness).

### Pre-training GPT-2

![CAME_gpt2](assets/gpt-2_came.png)

The pre-training of GPT-2 (Medium, 345M) is based on [Megatron-LM](https://github.com/NVIDIA/Megatron-LM). To replicate our result, add the CAME optimizer in [`megatron/optimizer/__init__.py`](https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/optimizer/__init__.py) and set the *args.optimizer* to "came".

## Memory Usage Comparison
To ensure a fair comparison, we set the batch size to 1 for the pre-training of GPT-2 (Medium) to examine the memory footprint of CAME and AdamW.

|              | AdamW | CAME     | 
|--------------|-------|----------|
| Memory (GiB) | 8.77  | **7.44** | 

## Citation

```bibtex
@inproceedings{luo2023came,
  title={CAME: Confidence-guided Adaptive Memory Efficient Optimization},
  author={Luo, Yang and Ren, Xiaozhe and Zheng, Zangwei and Jiang, Zhuo and Jiang, Xin and You, Yang},
  booktitle={Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  pages={4442--4453},
  year={2023}
}
```

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/yangluo7/CAME/",
    "name": "came-pytorch",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": "",
    "keywords": "artificial intelligence,deep learning,optimizers,memory efficient",
    "author": "Yang Luo",
    "author_email": "yangluo@comp.nus.edu.sg",
    "download_url": "https://files.pythonhosted.org/packages/28/c0/a33e490afcefab3fa0788a40b3cdcaa8e651cbc90731871925d15be02e7d/came-pytorch-0.1.3.tar.gz",
    "platform": null,
    "description": "<h1 align=\"center\">CAME Optimizer</h1>\n<h3 align=\"center\">ACL 2023 Outstanding Paper Award<br/>Confidence-guided Adaptive Memory Efficient Optimization</h3>\n\n\nThis is an official implementation of **CAME** optimizer in the \"[Confidence-guided Adaptive Memory Efficient Optimization](https://arxiv.org/abs/2307.02047)\". Please cite the paper and star this repo if you find CAME useful. Thanks!\n\n[Paper](https://arxiv.org/abs/2307.02047) | [Twitter](https://twitter.com/ZangweiZheng/status/1680227732788236289) | [Blog](https://zhengzangw.github.io/blogs/came) | [Pypi Package](https://pypi.org/project/came-pytorch/) | [zhihu](https://zhuanlan.zhihu.com/p/643816029)\n## Method\n\nIn this work, we studied a confidence-guided strategy to reduce the instability of existing memory efficient optimizers. Based on this strategy, we proposed CAME to simultaneously achieve two goals: fast convergence as in traditional adaptive methods, and low memory usage as in memory-efficient methods.\n\nThe pseudo code is presented in the figure with difference with Adafactor in blue fonts.\n\n<p align=\"center\">\n<img src=\"assets/came_code.png\" alt=\"CAME optimizer pseudo code\" width=\"50%\" />\n</p>\n<!-- ![CAME_code](assets/came_code.png) -->\n\n## Install \n```\npip install came-pytorch\n```\n## Usage\n\n```python\nfrom came_pytorch import CAME\noptimizer = CAME(\n    model.parameters(),\n    lr=2e-4,\n    weight_decay=1e-2,\n    betas=(0.9, 0.999, 0.9999),\n    eps=(1e-30, 1e-16)\n)\n```\n\n## Hyper-parameter Tuning\n\n* Pre-training: Based on our experiments on BERT-Large, GPT-2 and T5, it's suitable to choose a learning rate for CAME 3-1x smaller than that for AdamW.\n* Consider choosing $\\beta_3$ between $[0.9995, 0.99995]$ if setting $\\beta_1, \\beta_2=0.9, 0.999$. Due to computational resource constraints, we did not explore more combinations of three betas. Different training tasks may require different combinations of optimal performance.\n* If you have any feedback or comments regarding hyper-parameter tuning, please do not hesitate to provide them to us!\n\n## Experiments\n\nApart from the BERT and T5 experiments shown in the paper, we conduct more and record the results here.\n\n### Fine-tuning LLaMA-7B\n\n|                | MMLU      | WikiText | HellaSwag | TruthfulQA (MC) | BoolQ     | COPA      | WSC       | WIC       |\n| -------------- | --------- | -------- | --------- | --------------- | --------- | --------- | --------- | --------- |\n| Alpaca-7B      | 40.21     | 6.74     | 59.76     | **38.89**       | **79.57** | **88.00** | 46.15     | 49.84     |\n| Alpaca-7B-CAME | **40.59** | **6.38** | **59.80** | 38.61           | 79.08     | **88.00** | **49.04** | **50.78** |\n\nWe fine-tuned LLaMA-7B with [stanford-alpaca](https://github.com/tatsu-lab/stanford_alpaca) (52k instruction-tuning dataset). To replicate our result, first register the CAME optimizer to the transformer package. Then in Alpaca training script, change the default optimizer from \"adamw\" to \"came\".\n\nAlpaca-7B and Alpaca-7B-CAME are evaluated using [Instruct-eval](https://github.com/declare-lab/instruct-eval) and [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness).\n\n### Pre-training GPT-2\n\n![CAME_gpt2](assets/gpt-2_came.png)\n\nThe pre-training of GPT-2 (Medium, 345M) is based on [Megatron-LM](https://github.com/NVIDIA/Megatron-LM). To replicate our result, add the CAME optimizer in [`megatron/optimizer/__init__.py`](https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/optimizer/__init__.py) and set the *args.optimizer* to \"came\".\n\n## Memory Usage Comparison\nTo ensure a fair comparison, we set the batch size to 1 for the pre-training of GPT-2 (Medium) to examine the memory footprint of CAME and AdamW.\n\n|              | AdamW | CAME     | \n|--------------|-------|----------|\n| Memory (GiB) | 8.77  | **7.44** | \n\n## Citation\n\n```bibtex\n@inproceedings{luo2023came,\n  title={CAME: Confidence-guided Adaptive Memory Efficient Optimization},\n  author={Luo, Yang and Ren, Xiaozhe and Zheng, Zangwei and Jiang, Zhuo and Jiang, Xin and You, Yang},\n  booktitle={Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},\n  pages={4442--4453},\n  year={2023}\n}\n```\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "CAME Optimizer - Pytorch Version",
    "version": "0.1.3",
    "project_urls": {
        "Homepage": "https://github.com/yangluo7/CAME/"
    },
    "split_keywords": [
        "artificial intelligence",
        "deep learning",
        "optimizers",
        "memory efficient"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "b53fee0265e886eb3052419e0110f12cbaa0471f042d0b702096fc4b51d63957",
                "md5": "f2847bd24a2b5e39ed5f288e4b2e4d9a",
                "sha256": "576b30dd168cb688102806a4a1c8eb215fc2958b76c2e8a349a9e9c0a1c42f0a"
            },
            "downloads": -1,
            "filename": "came_pytorch-0.1.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "f2847bd24a2b5e39ed5f288e4b2e4d9a",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6",
            "size": 6156,
            "upload_time": "2024-02-05T13:49:54",
            "upload_time_iso_8601": "2024-02-05T13:49:54.832657Z",
            "url": "https://files.pythonhosted.org/packages/b5/3f/ee0265e886eb3052419e0110f12cbaa0471f042d0b702096fc4b51d63957/came_pytorch-0.1.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "28c0a33e490afcefab3fa0788a40b3cdcaa8e651cbc90731871925d15be02e7d",
                "md5": "d6b22cdfda8dfb0e516f25c80bc0b95b",
                "sha256": "d15cd5ae58f4df79b88e06cf78e4bbd0c31ab115df126475d7986a165c22836d"
            },
            "downloads": -1,
            "filename": "came-pytorch-0.1.3.tar.gz",
            "has_sig": false,
            "md5_digest": "d6b22cdfda8dfb0e516f25c80bc0b95b",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 5789,
            "upload_time": "2024-02-05T13:49:57",
            "upload_time_iso_8601": "2024-02-05T13:49:57.246544Z",
            "url": "https://files.pythonhosted.org/packages/28/c0/a33e490afcefab3fa0788a40b3cdcaa8e651cbc90731871925d15be02e7d/came-pytorch-0.1.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-02-05 13:49:57",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "yangluo7",
    "github_project": "CAME",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "came-pytorch"
}

Yang Luo