dadaptation

Name	dadaptation JSON
Version	3.2 JSON
	download
home_page	https://github.com/facebookresearch/dadaptation
Summary	Learning Rate Free Learning for Adam, SGD and AdaGrad
upload_time	2023-11-27 16:44:15
maintainer
docs_url	None
author	Aaron Defazio
requires_python	>=3.6
license
keywords
VCS
bugtrack_url
requirements	torch
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # D-Adaptation
[![Downloads](https://static.pepy.tech/badge/dadaptation)](https://pepy.tech/project/dadaptation) [![Downloads](https://static.pepy.tech/badge/dadaptation/month)](https://pepy.tech/project/dadaptation)

Learning rate free learning for SGD, AdaGrad and Adam! 

*by Aaron Defazio and Konstantin Mishchenko [(Arxiv)](https://arxiv.org/abs/2301.07733)*

``` pip install dadaptation ```

**NEW V3.0 release uses an improved algorithm that may give different results from past versions. The old version is still availiable under experimental/d_adapt_adam_preprint.**

## NEW: Prodigy
We have recently released the [Prodigy](https://github.com/konstmish/prodigy) method, which grows the adapted learning rate faster than D-Adaptation in theory and practice. Try it out if D-Adaptation is under-estimating the learning rate.

## How To Cite
If you use D-Adaptation in a publication, please cite our work as 
```
@ARTICLE{defazio2023dadapt,
author = {Aaron Defazio and Konstantin Mishchenko},
title = {Learning-Rate-Free Learning by D-Adaptation},
journal = {The 40th International Conference on Machine Learning (ICML 2023)},
year = {2023}
}
```

## Details

The provided Pytorch Optimizer classes are drop-in replacements, either copy into your project or use via pip with dadaptation.DAdaptSGD,  dadaptation.DAdaptAdam or dadaptation.DAdaptAdaGrad.

 - **Set the LR parameter to 1.0**. This parameter is not ignored. Setting it larger to smaller will directly scale up or down the D-Adapted learning rate estimate.
 - Different per-layer learning rates can be achieved by setting the layer_scale value in each parameter-group. It defaults to 1.0, and scales each layer's learning rate relative to the other layers.
 - **Use the same learning rate scheduler you would normally use on the problem.**
 - The Adam variant supports AdamW style weight decay, just set decouple=True. It is not turned on by default, so if you are replacing your adam implementation, make sure you use decoupled if necessary.
 - It may be necessary to use larger weight decay than you would normally use, try a factor of 2 or 4 bigger if you see overfitting. D-Adaptation uses larger learning rates than people typically hand-choose, in some cases that requires more decay.
 - Use the log_every setting to see the learning rate being used (d*lr) and the current D bound.
 - Only the AdaGrad version supports sparse gradients. It does not adapt as efficiently as the other variants and should be considered experimental.
 
## Change Log

### Version 3.2
 - Added support for layer-wise scaling to DAdaptAdam.

### Version 3.0
 - Major improvements to DAdaptAdam, improving the performance particularly on Transformer models. This variant may behave differently in practice. The old version is availiable under experimental/d_adapt_adam_preprint if you wish to continue to use it.
 - The IP variant is now the main variant of the method.
 - Added Lion. This is highly experimental. Feedback on it's performance is welcome.

### Version 2.0
 - Added Adan - should still be considered experimental.
 - Added support for PyTorch's Fully Sharded Data Parallel. 
 - Improved support of edge cases such as learning rate zero.
 - Improved logging - uses Python logging rather than print statements

 # Experimental results

![vision](figures/dadapt_cifar.png)
![vision](figures/dadapt_cifar100.png)
![vision](figures/dadapt_imagenet.png)
![vision](figures/dadapt_vit.png)
![vision](figures/dadapt_lstm.png)
![vision](figures/dadapt_roberta.png)
![vision](figures/dadapt_gpt.png)
![vision](figures/dadapt_fastmri.png)
![vision](figures/dadapt_detectron.png)
![vision](figures/dadapt_dlrm.png)

# License
See the [License file](/LICENSE).

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/facebookresearch/dadaptation",
    "name": "dadaptation",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": "",
    "keywords": "",
    "author": "Aaron Defazio",
    "author_email": "adefazio@meta.com",
    "download_url": "https://files.pythonhosted.org/packages/d6/46/d6fb2546c58a1abdd804ad5c8a1a53314a669fbb64ca97281931942928fd/dadaptation-3.2.tar.gz",
    "platform": null,
    "description": "# D-Adaptation\n[![Downloads](https://static.pepy.tech/badge/dadaptation)](https://pepy.tech/project/dadaptation) [![Downloads](https://static.pepy.tech/badge/dadaptation/month)](https://pepy.tech/project/dadaptation)\n\nLearning rate free learning for SGD, AdaGrad and Adam! \n\n*by Aaron Defazio and Konstantin Mishchenko [(Arxiv)](https://arxiv.org/abs/2301.07733)*\n\n``` pip install dadaptation ```\n\n**NEW V3.0 release uses an improved algorithm that may give different results from past versions. The old version is still availiable under experimental/d_adapt_adam_preprint.**\n\n## NEW: Prodigy\nWe have recently released the [Prodigy](https://github.com/konstmish/prodigy) method, which grows the adapted learning rate faster than D-Adaptation in theory and practice. Try it out if D-Adaptation is under-estimating the learning rate.\n\n## How To Cite\nIf you use D-Adaptation in a publication, please cite our work as \n```\n@ARTICLE{defazio2023dadapt,\nauthor = {Aaron Defazio and Konstantin Mishchenko},\ntitle = {Learning-Rate-Free Learning by D-Adaptation},\njournal = {The 40th International Conference on Machine Learning (ICML 2023)},\nyear = {2023}\n}\n```\n\n## Details\n\nThe provided Pytorch Optimizer classes are drop-in replacements, either copy into your project or use via pip with dadaptation.DAdaptSGD,  dadaptation.DAdaptAdam or dadaptation.DAdaptAdaGrad.\n\n - **Set the LR parameter to 1.0**. This parameter is not ignored. Setting it larger to smaller will directly scale up or down the D-Adapted learning rate estimate.\n - Different per-layer learning rates can be achieved by setting the layer_scale value in each parameter-group. It defaults to 1.0, and scales each layer's learning rate relative to the other layers.\n - **Use the same learning rate scheduler you would normally use on the problem.**\n - The Adam variant supports AdamW style weight decay, just set decouple=True. It is not turned on by default, so if you are replacing your adam implementation, make sure you use decoupled if necessary.\n - It may be necessary to use larger weight decay than you would normally use, try a factor of 2 or 4 bigger if you see overfitting. D-Adaptation uses larger learning rates than people typically hand-choose, in some cases that requires more decay.\n - Use the log_every setting to see the learning rate being used (d*lr) and the current D bound.\n - Only the AdaGrad version supports sparse gradients. It does not adapt as efficiently as the other variants and should be considered experimental.\n \n## Change Log\n\n### Version 3.2\n - Added support for layer-wise scaling to DAdaptAdam.\n\n### Version 3.0\n - Major improvements to DAdaptAdam, improving the performance particularly on Transformer models. This variant may behave differently in practice. The old version is availiable under experimental/d_adapt_adam_preprint if you wish to continue to use it.\n - The IP variant is now the main variant of the method.\n - Added Lion. This is highly experimental. Feedback on it's performance is welcome.\n\n### Version 2.0\n - Added Adan - should still be considered experimental.\n - Added support for PyTorch's Fully Sharded Data Parallel. \n - Improved support of edge cases such as learning rate zero.\n - Improved logging - uses Python logging rather than print statements\n\n # Experimental results\n\n![vision](figures/dadapt_cifar.png)\n![vision](figures/dadapt_cifar100.png)\n![vision](figures/dadapt_imagenet.png)\n![vision](figures/dadapt_vit.png)\n![vision](figures/dadapt_lstm.png)\n![vision](figures/dadapt_roberta.png)\n![vision](figures/dadapt_gpt.png)\n![vision](figures/dadapt_fastmri.png)\n![vision](figures/dadapt_detectron.png)\n![vision](figures/dadapt_dlrm.png)\n\n# License\nSee the [License file](/LICENSE).\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "Learning Rate Free Learning for Adam, SGD and AdaGrad",
    "version": "3.2",
    "project_urls": {
        "Homepage": "https://github.com/facebookresearch/dadaptation"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "d646d6fb2546c58a1abdd804ad5c8a1a53314a669fbb64ca97281931942928fd",
                "md5": "a6f859461d4939b79b0c1983ec271b29",
                "sha256": "de8e8289d56bfdee0c8e3ca353143295d8a48363bcb02d6868a708354b47ccc0"
            },
            "downloads": -1,
            "filename": "dadaptation-3.2.tar.gz",
            "has_sig": false,
            "md5_digest": "a6f859461d4939b79b0c1983ec271b29",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 13761,
            "upload_time": "2023-11-27T16:44:15",
            "upload_time_iso_8601": "2023-11-27T16:44:15.591892Z",
            "url": "https://files.pythonhosted.org/packages/d6/46/d6fb2546c58a1abdd804ad5c8a1a53314a669fbb64ca97281931942928fd/dadaptation-3.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-11-27 16:44:15",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "facebookresearch",
    "github_project": "dadaptation",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "torch",
            "specs": [
                [
                    ">=",
                    "1.5.1"
                ]
            ]
        }
    ],
    "lcname": "dadaptation"
}

Aaron Defazio