# D-Adaptation
[](https://pepy.tech/project/dadaptation) [](https://pepy.tech/project/dadaptation)
Learning rate free learning for SGD, AdaGrad and Adam!
*by Aaron Defazio and Konstantin Mishchenko [(Arxiv)](https://arxiv.org/abs/2301.07733)*
``` pip install dadaptation ```
**NEW V3.0 release uses an improved algorithm that may give different results from past versions. The old version is still availiable under experimental/d_adapt_adam_preprint.**
## NEW: Prodigy
We have recently released the [Prodigy](https://github.com/konstmish/prodigy) method, which grows the adapted learning rate faster than D-Adaptation in theory and practice. Try it out if D-Adaptation is under-estimating the learning rate.
## How To Cite
If you use D-Adaptation in a publication, please cite our work as
```
@ARTICLE{defazio2023dadapt,
author = {Aaron Defazio and Konstantin Mishchenko},
title = {Learning-Rate-Free Learning by D-Adaptation},
journal = {The 40th International Conference on Machine Learning (ICML 2023)},
year = {2023}
}
```
## Details
The provided Pytorch Optimizer classes are drop-in replacements, either copy into your project or use via pip with dadaptation.DAdaptSGD, dadaptation.DAdaptAdam or dadaptation.DAdaptAdaGrad.
- **Set the LR parameter to 1.0**. This parameter is not ignored. Setting it larger to smaller will directly scale up or down the D-Adapted learning rate estimate.
- Different per-layer learning rates can be achieved by setting the layer_scale value in each parameter-group. It defaults to 1.0, and scales each layer's learning rate relative to the other layers.
- **Use the same learning rate scheduler you would normally use on the problem.**
- The Adam variant supports AdamW style weight decay, just set decouple=True. It is not turned on by default, so if you are replacing your adam implementation, make sure you use decoupled if necessary.
- It may be necessary to use larger weight decay than you would normally use, try a factor of 2 or 4 bigger if you see overfitting. D-Adaptation uses larger learning rates than people typically hand-choose, in some cases that requires more decay.
- Use the log_every setting to see the learning rate being used (d*lr) and the current D bound.
- Only the AdaGrad version supports sparse gradients. It does not adapt as efficiently as the other variants and should be considered experimental.
## Change Log
### Version 3.2
- Added support for layer-wise scaling to DAdaptAdam.
### Version 3.0
- Major improvements to DAdaptAdam, improving the performance particularly on Transformer models. This variant may behave differently in practice. The old version is availiable under experimental/d_adapt_adam_preprint if you wish to continue to use it.
- The IP variant is now the main variant of the method.
- Added Lion. This is highly experimental. Feedback on it's performance is welcome.
### Version 2.0
- Added Adan - should still be considered experimental.
- Added support for PyTorch's Fully Sharded Data Parallel.
- Improved support of edge cases such as learning rate zero.
- Improved logging - uses Python logging rather than print statements
# Experimental results










# License
See the [License file](/LICENSE).
Raw data
{
"_id": null,
"home_page": "https://github.com/facebookresearch/dadaptation",
"name": "dadaptation",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.6",
"maintainer_email": "",
"keywords": "",
"author": "Aaron Defazio",
"author_email": "adefazio@meta.com",
"download_url": "https://files.pythonhosted.org/packages/d6/46/d6fb2546c58a1abdd804ad5c8a1a53314a669fbb64ca97281931942928fd/dadaptation-3.2.tar.gz",
"platform": null,
"description": "# D-Adaptation\n[](https://pepy.tech/project/dadaptation) [](https://pepy.tech/project/dadaptation)\n\nLearning rate free learning for SGD, AdaGrad and Adam! \n\n*by Aaron Defazio and Konstantin Mishchenko [(Arxiv)](https://arxiv.org/abs/2301.07733)*\n\n``` pip install dadaptation ```\n\n**NEW V3.0 release uses an improved algorithm that may give different results from past versions. The old version is still availiable under experimental/d_adapt_adam_preprint.**\n\n## NEW: Prodigy\nWe have recently released the [Prodigy](https://github.com/konstmish/prodigy) method, which grows the adapted learning rate faster than D-Adaptation in theory and practice. Try it out if D-Adaptation is under-estimating the learning rate.\n\n## How To Cite\nIf you use D-Adaptation in a publication, please cite our work as \n```\n@ARTICLE{defazio2023dadapt,\nauthor = {Aaron Defazio and Konstantin Mishchenko},\ntitle = {Learning-Rate-Free Learning by D-Adaptation},\njournal = {The 40th International Conference on Machine Learning (ICML 2023)},\nyear = {2023}\n}\n```\n\n## Details\n\nThe provided Pytorch Optimizer classes are drop-in replacements, either copy into your project or use via pip with dadaptation.DAdaptSGD, dadaptation.DAdaptAdam or dadaptation.DAdaptAdaGrad.\n\n - **Set the LR parameter to 1.0**. This parameter is not ignored. Setting it larger to smaller will directly scale up or down the D-Adapted learning rate estimate.\n - Different per-layer learning rates can be achieved by setting the layer_scale value in each parameter-group. It defaults to 1.0, and scales each layer's learning rate relative to the other layers.\n - **Use the same learning rate scheduler you would normally use on the problem.**\n - The Adam variant supports AdamW style weight decay, just set decouple=True. It is not turned on by default, so if you are replacing your adam implementation, make sure you use decoupled if necessary.\n - It may be necessary to use larger weight decay than you would normally use, try a factor of 2 or 4 bigger if you see overfitting. D-Adaptation uses larger learning rates than people typically hand-choose, in some cases that requires more decay.\n - Use the log_every setting to see the learning rate being used (d*lr) and the current D bound.\n - Only the AdaGrad version supports sparse gradients. It does not adapt as efficiently as the other variants and should be considered experimental.\n \n## Change Log\n\n### Version 3.2\n - Added support for layer-wise scaling to DAdaptAdam.\n\n### Version 3.0\n - Major improvements to DAdaptAdam, improving the performance particularly on Transformer models. This variant may behave differently in practice. The old version is availiable under experimental/d_adapt_adam_preprint if you wish to continue to use it.\n - The IP variant is now the main variant of the method.\n - Added Lion. This is highly experimental. Feedback on it's performance is welcome.\n\n### Version 2.0\n - Added Adan - should still be considered experimental.\n - Added support for PyTorch's Fully Sharded Data Parallel. \n - Improved support of edge cases such as learning rate zero.\n - Improved logging - uses Python logging rather than print statements\n\n # Experimental results\n\n\n\n\n\n\n\n\n\n\n\n\n# License\nSee the [License file](/LICENSE).\n",
"bugtrack_url": null,
"license": "",
"summary": "Learning Rate Free Learning for Adam, SGD and AdaGrad",
"version": "3.2",
"project_urls": {
"Homepage": "https://github.com/facebookresearch/dadaptation"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "d646d6fb2546c58a1abdd804ad5c8a1a53314a669fbb64ca97281931942928fd",
"md5": "a6f859461d4939b79b0c1983ec271b29",
"sha256": "de8e8289d56bfdee0c8e3ca353143295d8a48363bcb02d6868a708354b47ccc0"
},
"downloads": -1,
"filename": "dadaptation-3.2.tar.gz",
"has_sig": false,
"md5_digest": "a6f859461d4939b79b0c1983ec271b29",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.6",
"size": 13761,
"upload_time": "2023-11-27T16:44:15",
"upload_time_iso_8601": "2023-11-27T16:44:15.591892Z",
"url": "https://files.pythonhosted.org/packages/d6/46/d6fb2546c58a1abdd804ad5c8a1a53314a669fbb64ca97281931942928fd/dadaptation-3.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-11-27 16:44:15",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "facebookresearch",
"github_project": "dadaptation",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "torch",
"specs": [
[
">=",
"1.5.1"
]
]
}
],
"lcname": "dadaptation"
}