torch-optimizer

Name	torch-optimizer JSON
Version	0.3.0 JSON
	download
home_page	https://github.com/jettify/pytorch-optimizer
Summary	pytorch-optimizer
upload_time	2021-10-31 03:00:22
maintainer
docs_url	None
author	Nikolay Novik
requires_python	>=3.6.0
license	Apache 2
keywords	torch-optimizer pytorch accsgd adabound adamod diffgrad lamb lookahead madgrad novograd pid qhadam qhm radam sgdw yogi ranger
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage

            torch-optimizer
===============
.. image:: https://github.com/jettify/pytorch-optimizer/workflows/CI/badge.svg
   :target: https://github.com/jettify/pytorch-optimizer/actions?query=workflow%3ACI
   :alt: GitHub Actions status for master branch
.. image:: https://codecov.io/gh/jettify/pytorch-optimizer/branch/master/graph/badge.svg
    :target: https://codecov.io/gh/jettify/pytorch-optimizer
.. image:: https://img.shields.io/pypi/pyversions/torch-optimizer.svg
    :target: https://pypi.org/project/torch-optimizer
.. image:: https://readthedocs.org/projects/pytorch-optimizer/badge/?version=latest
    :target: https://pytorch-optimizer.readthedocs.io/en/latest/?badge=latest
    :alt: Documentation Status
.. image:: https://img.shields.io/pypi/v/torch-optimizer.svg
    :target: https://pypi.python.org/pypi/torch-optimizer
.. image:: https://static.deepsource.io/deepsource-badge-light-mini.svg
    :target: https://deepsource.io/gh/jettify/pytorch-optimizer/?ref=repository-badge


**torch-optimizer** -- collection of optimizers for PyTorch_ compatible with optim_
module.


Simple example
--------------

.. code:: python

    import torch_optimizer as optim

    # model = ...
    optimizer = optim.DiffGrad(model.parameters(), lr=0.001)
    optimizer.step()


Installation
------------
Installation process is simple, just::

    $ pip install torch_optimizer


Documentation
-------------
https://pytorch-optimizer.rtfd.io


Citation
--------
Please cite original authors of optimization algorithms. If you like this
package::

    @software{Novik_torchoptimizers,
    	title        = {{torch-optimizer -- collection of optimization algorithms for PyTorch.}},
    	author       = {Novik, Mykola},
    	year         = 2020,
    	month        = 1,
    	version      = {1.0.1}
    }

Or use github feature: "cite this repository" button.


Supported Optimizers
====================

+---------------+--------------------------------------------------------------------------------------------------------------------------------------+
|               |                                                                                                                                      |
| `A2GradExp`_  | https://arxiv.org/abs/1810.00553                                                                                                     |
+---------------+--------------------------------------------------------------------------------------------------------------------------------------+
|               |                                                                                                                                      |
| `A2GradInc`_  | https://arxiv.org/abs/1810.00553                                                                                                     |
+---------------+--------------------------------------------------------------------------------------------------------------------------------------+
|               |                                                                                                                                      |
| `A2GradUni`_  | https://arxiv.org/abs/1810.00553                                                                                                     |
+---------------+--------------------------------------------------------------------------------------------------------------------------------------+
|               |                                                                                                                                      |
| `AccSGD`_     | https://arxiv.org/abs/1803.05591                                                                                                     |
+---------------+--------------------------------------------------------------------------------------------------------------------------------------+
|               |                                                                                                                                      |
| `AdaBelief`_  | https://arxiv.org/abs/2010.07468                                                                                                     |
+---------------+--------------------------------------------------------------------------------------------------------------------------------------+
|               |                                                                                                                                      |
| `AdaBound`_   | https://arxiv.org/abs/1902.09843                                                                                                     |
+---------------+--------------------------------------------------------------------------------------------------------------------------------------+
|               |                                                                                                                                      |
| `AdaMod`_     | https://arxiv.org/abs/1910.12249                                                                                                     |
+---------------+--------------------------------------------------------------------------------------------------------------------------------------+
|               |                                                                                                                                      |
| `Adafactor`_  | https://arxiv.org/abs/1804.04235                                                                                                     |
+---------------+--------------------------------------------------------------------------------------------------------------------------------------+
|               |                                                                                                                                      |
| `Adahessian`_ | https://arxiv.org/abs/2006.00719                                                                                                     |
+---------------+--------------------------------------------------------------------------------------------------------------------------------------+
|               |                                                                                                                                      |
| `AdamP`_      | https://arxiv.org/abs/2006.08217                                                                                                     |
+---------------+--------------------------------------------------------------------------------------------------------------------------------------+
|               |                                                                                                                                      |
| `AggMo`_      | https://arxiv.org/abs/1804.00325                                                                                                     |
+---------------+--------------------------------------------------------------------------------------------------------------------------------------+
|               |                                                                                                                                      |
| `Apollo`_     | https://arxiv.org/abs/2009.13586                                                                                                     |
+---------------+--------------------------------------------------------------------------------------------------------------------------------------+
|               |                                                                                                                                      |
| `DiffGrad`_   | https://arxiv.org/abs/1909.11015                                                                                                     |
+---------------+--------------------------------------------------------------------------------------------------------------------------------------+
|               |                                                                                                                                      |
| `Lamb`_       | https://arxiv.org/abs/1904.00962                                                                                                     |
+---------------+--------------------------------------------------------------------------------------------------------------------------------------+
|               |                                                                                                                                      |
| `Lookahead`_  | https://arxiv.org/abs/1907.08610                                                                                                     |
+---------------+--------------------------------------------------------------------------------------------------------------------------------------+
|               |                                                                                                                                      |
| `MADGRAD`_    | https://arxiv.org/abs/2101.11075                                                                                                     |
+---------------+--------------------------------------------------------------------------------------------------------------------------------------+
|               |                                                                                                                                      |
| `NovoGrad`_   | https://arxiv.org/abs/1905.11286                                                                                                     |
+---------------+--------------------------------------------------------------------------------------------------------------------------------------+
|               |                                                                                                                                      |
| `PID`_        | https://www4.comp.polyu.edu.hk/~cslzhang/paper/CVPR18_PID.pdf                                                                        |
+---------------+--------------------------------------------------------------------------------------------------------------------------------------+
|               |                                                                                                                                      |
| `QHAdam`_     | https://arxiv.org/abs/1810.06801                                                                                                     |
+---------------+--------------------------------------------------------------------------------------------------------------------------------------+
|               |                                                                                                                                      |
| `QHM`_        | https://arxiv.org/abs/1810.06801                                                                                                     |
+---------------+--------------------------------------------------------------------------------------------------------------------------------------+
|               |                                                                                                                                      |
| `RAdam`_      | https://arxiv.org/abs/1908.03265                                                                                                     |
+---------------+--------------------------------------------------------------------------------------------------------------------------------------+
|               |                                                                                                                                      |
| `Ranger`_     | https://medium.com/@lessw/new-deep-learning-optimizer-ranger-synergistic-combination-of-radam-lookahead-for-the-best-of-2dc83f79a48d |
+---------------+--------------------------------------------------------------------------------------------------------------------------------------+
|               |                                                                                                                                      |
| `RangerQH`_   | https://arxiv.org/abs/1810.06801                                                                                                     |
+---------------+--------------------------------------------------------------------------------------------------------------------------------------+
|               |                                                                                                                                      |
| `RangerVA`_   | https://arxiv.org/abs/1908.00700v2                                                                                                   |
+---------------+--------------------------------------------------------------------------------------------------------------------------------------+
|               |                                                                                                                                      |
| `SGDP`_       | https://arxiv.org/abs/2006.08217                                                                                                     |
+---------------+--------------------------------------------------------------------------------------------------------------------------------------+
|               |                                                                                                                                      |
| `SGDW`_       | https://arxiv.org/abs/1608.03983                                                                                                     |
+---------------+--------------------------------------------------------------------------------------------------------------------------------------+
|               |                                                                                                                                      |
| `SWATS`_      | https://arxiv.org/abs/1712.07628                                                                                                     |
+---------------+--------------------------------------------------------------------------------------------------------------------------------------+
|               |                                                                                                                                      |
| `Shampoo`_    | https://arxiv.org/abs/1802.09568                                                                                                     |
+---------------+--------------------------------------------------------------------------------------------------------------------------------------+
|               |                                                                                                                                      |
| `Yogi`_       | https://papers.nips.cc/paper/8186-adaptive-methods-for-nonconvex-optimization                                                        |
+---------------+--------------------------------------------------------------------------------------------------------------------------------------+


Visualizations
--------------
Visualizations help us to see how different algorithms deals with simple
situations like: saddle points, local minima, valleys etc, and may provide
interesting insights into inner workings of algorithm. Rosenbrock_ and Rastrigin_
benchmark_ functions was selected, because:

* Rosenbrock_ (also known as banana function), is non-convex function that has
  one global minima  `(1.0. 1.0)`. The global minimum is inside a long,
  narrow, parabolic shaped flat valley. To find the valley is trivial. To
  converge to the global minima, however, is difficult. Optimization
  algorithms might pay a lot of attention to one coordinate, and have
  problems to follow valley which is relatively flat.

 .. image::  https://upload.wikimedia.org/wikipedia/commons/3/32/Rosenbrock_function.svg

* Rastrigin_ function is a non-convex and has one global minima in `(0.0, 0.0)`.
  Finding the minimum of this function is a fairly difficult problem due to
  its large search space and its large number of local minima.

  .. image::  https://upload.wikimedia.org/wikipedia/commons/8/8b/Rastrigin_function.png

Each optimizer performs `501` optimization steps. Learning rate is best one found
by hyper parameter search algorithm, rest of tuning parameters are default. It
is very easy to extend script and tune other optimizer parameters.


.. code::

    python examples/viz_optimizers.py


Warning
-------
Do not pick optimizer based on visualizations, optimization approaches
have unique properties and may be tailored for different purposes or may
require explicit learning rate schedule etc. Best way to find out, is to try one
on your particular problem and see if it improves scores.

If you do not know which optimizer to use start with built in SGD/Adam, once
training logic is ready and baseline scores are established, swap optimizer and
see if there is any improvement.


A2GradExp
---------

+--------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------+
| .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rastrigin_A2GradExp.png   |  .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rosenbrock_A2GradExp.png  |
+--------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------+

.. code:: python

    import torch_optimizer as optim

    # model = ...
    optimizer = optim.A2GradExp(
        model.parameters(),
        kappa=1000.0,
        beta=10.0,
        lips=10.0,
        rho=0.5,
    )
    optimizer.step()


**Paper**: *Optimal Adaptive and Accelerated Stochastic Gradient Descent* (2018) [https://arxiv.org/abs/1810.00553]

**Reference Code**: https://github.com/severilov/A2Grad_optimizer


A2GradInc
---------

+--------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------+
| .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rastrigin_A2GradInc.png   |  .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rosenbrock_A2GradInc.png  |
+--------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------+

.. code:: python

    import torch_optimizer as optim

    # model = ...
    optimizer = optim.A2GradInc(
        model.parameters(),
        kappa=1000.0,
        beta=10.0,
        lips=10.0,
    )
    optimizer.step()


**Paper**: *Optimal Adaptive and Accelerated Stochastic Gradient Descent* (2018) [https://arxiv.org/abs/1810.00553]

**Reference Code**: https://github.com/severilov/A2Grad_optimizer


A2GradUni
---------

+--------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------+
| .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rastrigin_A2GradUni.png   |  .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rosenbrock_A2GradUni.png  |
+--------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------+

.. code:: python

    import torch_optimizer as optim

    # model = ...
    optimizer = optim.A2GradUni(
        model.parameters(),
        kappa=1000.0,
        beta=10.0,
        lips=10.0,
    )
    optimizer.step()


**Paper**: *Optimal Adaptive and Accelerated Stochastic Gradient Descent* (2018) [https://arxiv.org/abs/1810.00553]

**Reference Code**: https://github.com/severilov/A2Grad_optimizer


AccSGD
------

+-----------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------+
| .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rastrigin_AccSGD.png   |  .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rosenbrock_AccSGD.png  |
+-----------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------+

.. code:: python

    import torch_optimizer as optim

    # model = ...
    optimizer = optim.AccSGD(
        model.parameters(),
        lr=1e-3,
        kappa=1000.0,
        xi=10.0,
        small_const=0.7,
        weight_decay=0
    )
    optimizer.step()


**Paper**: *On the insufficiency of existing momentum schemes for Stochastic Optimization* (2019) [https://arxiv.org/abs/1803.05591]

**Reference Code**: https://github.com/rahulkidambi/AccSGD


AdaBelief
---------

+-------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------+
| .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rastrigin_AdaBelief.png  |  .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rosenbrock_AdaBelief.png |
+-------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------+

.. code:: python

    import torch_optimizer as optim

    # model = ...
    optimizer = optim.AdaBelief(
        m.parameters(),
        lr= 1e-3,
        betas=(0.9, 0.999),
        eps=1e-3,
        weight_decay=0,
        amsgrad=False,
        weight_decouple=False,
        fixed_decay=False,
        rectify=False,
    )
    optimizer.step()


**Paper**: *AdaBelief Optimizer, adapting stepsizes by the belief in observed gradients* (2020) [https://arxiv.org/abs/2010.07468]

**Reference Code**: https://github.com/juntang-zhuang/Adabelief-Optimizer


AdaBound
--------

+------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------+
| .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rastrigin_AdaBound.png  |  .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rosenbrock_AdaBound.png |
+------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------+

.. code:: python

    import torch_optimizer as optim

    # model = ...
    optimizer = optim.AdaBound(
        m.parameters(),
        lr= 1e-3,
        betas= (0.9, 0.999),
        final_lr = 0.1,
        gamma=1e-3,
        eps= 1e-8,
        weight_decay=0,
        amsbound=False,
    )
    optimizer.step()


**Paper**: *Adaptive Gradient Methods with Dynamic Bound of Learning Rate* (2019) [https://arxiv.org/abs/1902.09843]

**Reference Code**: https://github.com/Luolc/AdaBound

AdaMod
------
AdaMod method restricts the adaptive learning rates with adaptive and momental
upper bounds. The dynamic learning rate bounds are based on the exponential
moving averages of the adaptive learning rates themselves, which smooth out
unexpected large learning rates and stabilize the training of deep neural networks.

+------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------+
| .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rastrigin_AdaMod.png    |  .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rosenbrock_AdaMod.png   |
+------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------+

.. code:: python

    import torch_optimizer as optim

    # model = ...
    optimizer = optim.AdaMod(
        m.parameters(),
        lr= 1e-3,
        betas=(0.9, 0.999),
        beta3=0.999,
        eps=1e-8,
        weight_decay=0,
    )
    optimizer.step()

**Paper**: *An Adaptive and Momental Bound Method for Stochastic Learning.* (2019) [https://arxiv.org/abs/1910.12249]

**Reference Code**: https://github.com/lancopku/AdaMod


Adafactor
---------
+------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------+
| .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rastrigin_Adafactor.png |  .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rosenbrock_Adafactor.png |
+------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------+

.. code:: python

    import torch_optimizer as optim

    # model = ...
    optimizer = optim.Adafactor(
        m.parameters(),
        lr= 1e-3,
        eps2= (1e-30, 1e-3),
        clip_threshold=1.0,
        decay_rate=-0.8,
        beta1=None,
        weight_decay=0.0,
        scale_parameter=True,
        relative_step=True,
        warmup_init=False,
    )
    optimizer.step()

**Paper**: *Adafactor: Adaptive Learning Rates with Sublinear Memory Cost.* (2018) [https://arxiv.org/abs/1804.04235]

**Reference Code**: https://github.com/pytorch/fairseq/blob/master/fairseq/optim/adafactor.py


Adahessian
----------
+-------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------+
| .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rastrigin_Adahessian.png |  .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rosenbrock_Adahessian.png  |
+-------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------+

.. code:: python

    import torch_optimizer as optim

    # model = ...
    optimizer = optim.Adahessian(
        m.parameters(),
        lr= 1.0,
        betas= (0.9, 0.999),
        eps= 1e-4,
        weight_decay=0.0,
        hessian_power=1.0,
    )
	  loss_fn(m(input), target).backward(create_graph = True) # create_graph=True is necessary for Hessian calculation
    optimizer.step()


**Paper**: *ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning* (2020) [https://arxiv.org/abs/2006.00719]

**Reference Code**: https://github.com/amirgholami/adahessian


AdamP
------
AdamP propose a simple and effective solution: at each iteration of Adam optimizer
applied on scale-invariant weights (e.g., Conv weights preceding a BN layer), AdamP
remove the radial component (i.e., parallel to the weight vector) from the update vector.
Intuitively, this operation prevents the unnecessary update along the radial direction
that only increases the weight norm without contributing to the loss minimization.

+------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------+
| .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rastrigin_AdamP.png     |  .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rosenbrock_AdamP.png    |
+------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------+

.. code:: python

    import torch_optimizer as optim

    # model = ...
    optimizer = optim.AdamP(
        m.parameters(),
        lr= 1e-3,
        betas=(0.9, 0.999),
        eps=1e-8,
        weight_decay=0,
        delta = 0.1,
        wd_ratio = 0.1
    )
    optimizer.step()

**Paper**: *Slowing Down the Weight Norm Increase in Momentum-based Optimizers.* (2020) [https://arxiv.org/abs/2006.08217]

**Reference Code**: https://github.com/clovaai/AdamP


AggMo
-----

+------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------+
| .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rastrigin_AggMo.png     |  .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rosenbrock_AggMo.png    |
+------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------+

.. code:: python

    import torch_optimizer as optim

    # model = ...
    optimizer = optim.AggMo(
        m.parameters(),
        lr= 1e-3,
        betas=(0.0, 0.9, 0.99),
        weight_decay=0,
    )
    optimizer.step()

**Paper**: *Aggregated Momentum: Stability Through Passive Damping.* (2019) [https://arxiv.org/abs/1804.00325]

**Reference Code**: https://github.com/AtheMathmo/AggMo


Apollo
------

+------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------+
| .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rastrigin_Apollo.png    |  .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rosenbrock_Apollo.png   |
+------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------+

.. code:: python

    import torch_optimizer as optim

    # model = ...
    optimizer = optim.Apollo(
        m.parameters(),
        lr= 1e-2,
        beta=0.9,
        eps=1e-4,
        warmup=0,
        init_lr=0.01,
        weight_decay=0,
    )
    optimizer.step()

**Paper**: *Apollo: An Adaptive Parameter-wise Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimization.* (2020) [https://arxiv.org/abs/2009.13586]

**Reference Code**: https://github.com/XuezheMax/apollo


DiffGrad
--------
Optimizer based on the difference between the present and the immediate past
gradient, the step size is adjusted for each parameter in such
a way that it should have a larger step size for faster gradient changing
parameters and a lower step size for lower gradient changing parameters.

+------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------+
| .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rastrigin_DiffGrad.png  |  .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rosenbrock_DiffGrad.png  |
+------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------+

.. code:: python

    import torch_optimizer as optim

    # model = ...
    optimizer = optim.DiffGrad(
        m.parameters(),
        lr= 1e-3,
        betas=(0.9, 0.999),
        eps=1e-8,
        weight_decay=0,
    )
    optimizer.step()


**Paper**: *diffGrad: An Optimization Method for Convolutional Neural Networks.* (2019) [https://arxiv.org/abs/1909.11015]

**Reference Code**: https://github.com/shivram1987/diffGrad

Lamb
----

+--------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------+
| .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rastrigin_Lamb.png  |  .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rosenbrock_Lamb.png  |
+--------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------+

.. code:: python

    import torch_optimizer as optim

    # model = ...
    optimizer = optim.Lamb(
        m.parameters(),
        lr= 1e-3,
        betas=(0.9, 0.999),
        eps=1e-8,
        weight_decay=0,
    )
    optimizer.step()


**Paper**: *Large Batch Optimization for Deep Learning: Training BERT in 76 minutes* (2019) [https://arxiv.org/abs/1904.00962]

**Reference Code**: https://github.com/cybertronai/pytorch-lamb

Lookahead
---------

+-----------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------+
| .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rastrigin_LookaheadYogi.png  |  .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rosenbrock_LookaheadYogi.png  |
+-----------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------+

.. code:: python

    import torch_optimizer as optim

    # model = ...
    # base optimizer, any other optimizer can be used like Adam or DiffGrad
    yogi = optim.Yogi(
        m.parameters(),
        lr= 1e-2,
        betas=(0.9, 0.999),
        eps=1e-3,
        initial_accumulator=1e-6,
        weight_decay=0,
    )

    optimizer = optim.Lookahead(yogi, k=5, alpha=0.5)
    optimizer.step()


**Paper**: *Lookahead Optimizer: k steps forward, 1 step back* (2019) [https://arxiv.org/abs/1907.08610]

**Reference Code**: https://github.com/alphadl/lookahead.pytorch


MADGRAD
---------

+-----------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------+
| .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rastrigin_MADGRAD.png        |  .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rosenbrock_MADGRAD.png        |
+-----------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------+

.. code:: python

    import torch_optimizer as optim

    # model = ...
    optimizer = optim.MADGRAD(
        m.parameters(),
        lr=1e-2,
        momentum=0.9,
        weight_decay=0,
        eps=1e-6,
    )
    optimizer.step()


**Paper**: *Adaptivity without Compromise: A Momentumized, Adaptive, Dual Averaged Gradient Method for Stochastic Optimization* (2021) [https://arxiv.org/abs/2101.11075]

**Reference Code**: https://github.com/facebookresearch/madgrad


NovoGrad
--------

+------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------+
| .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rastrigin_NovoGrad.png  |  .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rosenbrock_NovoGrad.png  |
+------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------+

.. code:: python

    import torch_optimizer as optim

    # model = ...
    optimizer = optim.NovoGrad(
        m.parameters(),
        lr= 1e-3,
        betas=(0.9, 0.999),
        eps=1e-8,
        weight_decay=0,
        grad_averaging=False,
        amsgrad=False,
    )
    optimizer.step()


**Paper**: *Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networks* (2019) [https://arxiv.org/abs/1905.11286]

**Reference Code**: https://github.com/NVIDIA/DeepLearningExamples/


PID
---

+-------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------+
| .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rastrigin_PID.png  |  .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rosenbrock_PID.png  |
+-------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------+

.. code:: python

    import torch_optimizer as optim

    # model = ...
    optimizer = optim.PID(
        m.parameters(),
        lr=1e-3,
        momentum=0,
        dampening=0,
        weight_decay=1e-2,
        integral=5.0,
        derivative=10.0,
    )
    optimizer.step()


**Paper**: *A PID Controller Approach for Stochastic Optimization of Deep Networks* (2018) [http://www4.comp.polyu.edu.hk/~cslzhang/paper/CVPR18_PID.pdf]

**Reference Code**: https://github.com/tensorboy/PIDOptimizer


QHAdam
------

+----------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------+
| .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rastrigin_QHAdam.png  |  .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rosenbrock_QHAdam.png  |
+----------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------+

.. code:: python

    import torch_optimizer as optim

    # model = ...
    optimizer = optim.QHAdam(
        m.parameters(),
        lr= 1e-3,
        betas=(0.9, 0.999),
        nus=(1.0, 1.0),
        weight_decay=0,
        decouple_weight_decay=False,
        eps=1e-8,
    )
    optimizer.step()


**Paper**: *Quasi-hyperbolic momentum and Adam for deep learning* (2019) [https://arxiv.org/abs/1810.06801]

**Reference Code**: https://github.com/facebookresearch/qhoptim


QHM
---

+-------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------+
| .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rastrigin_QHM.png  |  .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rosenbrock_QHM.png  |
+-------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------+

.. code:: python

    import torch_optimizer as optim

    # model = ...
    optimizer = optim.QHM(
        m.parameters(),
        lr=1e-3,
        momentum=0,
        nu=0.7,
        weight_decay=1e-2,
        weight_decay_type='grad',
    )
    optimizer.step()


**Paper**: *Quasi-hyperbolic momentum and Adam for deep learning* (2019) [https://arxiv.org/abs/1810.06801]

**Reference Code**: https://github.com/facebookresearch/qhoptim


RAdam
-----

+---------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+
| .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rastrigin_RAdam.png  |  .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rosenbrock_RAdam.png  |
+---------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+

.. code:: python

    import torch_optimizer as optim

    # model = ...
    optimizer = optim.RAdam(
        m.parameters(),
        lr= 1e-3,
        betas=(0.9, 0.999),
        eps=1e-8,
        weight_decay=0,
    )
    optimizer.step()


**Paper**: *On the Variance of the Adaptive Learning Rate and Beyond* (2019) [https://arxiv.org/abs/1908.03265]

**Reference Code**: https://github.com/LiyuanLucasLiu/RAdam


Ranger
------

+----------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------+
| .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rastrigin_Ranger.png  |  .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rosenbrock_Ranger.png  |
+----------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------+

.. code:: python

    import torch_optimizer as optim

    # model = ...
    optimizer = optim.Ranger(
        m.parameters(),
        lr=1e-3,
        alpha=0.5,
        k=6,
        N_sma_threshhold=5,
        betas=(.95, 0.999),
        eps=1e-5,
        weight_decay=0
    )
    optimizer.step()


**Paper**: *New Deep Learning Optimizer, Ranger: Synergistic combination of RAdam + LookAhead for the best of both* (2019) [https://medium.com/@lessw/new-deep-learning-optimizer-ranger-synergistic-combination-of-radam-lookahead-for-the-best-of-2dc83f79a48d]

**Reference Code**: https://github.com/lessw2020/Ranger-Deep-Learning-Optimizer


RangerQH
--------

+------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------+
| .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rastrigin_RangerQH.png  |  .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rosenbrock_RangerQH.png  |
+------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------+

.. code:: python

    import torch_optimizer as optim

    # model = ...
    optimizer = optim.RangerQH(
        m.parameters(),
        lr=1e-3,
        betas=(0.9, 0.999),
        nus=(.7, 1.0),
        weight_decay=0.0,
        k=6,
        alpha=.5,
        decouple_weight_decay=False,
        eps=1e-8,
    )
    optimizer.step()


**Paper**: *Quasi-hyperbolic momentum and Adam for deep learning* (2018) [https://arxiv.org/abs/1810.06801]

**Reference Code**: https://github.com/lessw2020/Ranger-Deep-Learning-Optimizer


RangerVA
--------

+------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------+
| .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rastrigin_RangerVA.png  |  .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rosenbrock_RangerVA.png  |
+------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------+

.. code:: python

    import torch_optimizer as optim

    # model = ...
    optimizer = optim.RangerVA(
        m.parameters(),
        lr=1e-3,
        alpha=0.5,
        k=6,
        n_sma_threshhold=5,
        betas=(.95, 0.999),
        eps=1e-5,
        weight_decay=0,
        amsgrad=True,
        transformer='softplus',
        smooth=50,
        grad_transformer='square'
    )
    optimizer.step()


**Paper**: *Calibrating the Adaptive Learning Rate to Improve Convergence of ADAM* (2019) [https://arxiv.org/abs/1908.00700v2]

**Reference Code**: https://github.com/lessw2020/Ranger-Deep-Learning-Optimizer


SGDP
----

+--------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------+
| .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rastrigin_SGDP.png  |  .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rosenbrock_SGDP.png  |
+--------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------+

.. code:: python

    import torch_optimizer as optim

    # model = ...
    optimizer = optim.SGDP(
        m.parameters(),
        lr= 1e-3,
        momentum=0,
        dampening=0,
        weight_decay=1e-2,
        nesterov=False,
        delta = 0.1,
        wd_ratio = 0.1
    )
    optimizer.step()


**Paper**: *Slowing Down the Weight Norm Increase in Momentum-based Optimizers.* (2020) [https://arxiv.org/abs/2006.08217]

**Reference Code**: https://github.com/clovaai/AdamP


SGDW
----

+--------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------+
| .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rastrigin_SGDW.png  |  .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rosenbrock_SGDW.png  |
+--------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------+

.. code:: python

    import torch_optimizer as optim

    # model = ...
    optimizer = optim.SGDW(
        m.parameters(),
        lr= 1e-3,
        momentum=0,
        dampening=0,
        weight_decay=1e-2,
        nesterov=False,
    )
    optimizer.step()


**Paper**: *SGDR: Stochastic Gradient Descent with Warm Restarts* (2017) [https://arxiv.org/abs/1608.03983]

**Reference Code**: https://github.com/pytorch/pytorch/pull/22466


SWATS
-----

+---------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+
| .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rastrigin_SWATS.png  |  .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rosenbrock_SWATS.png  |
+---------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+

.. code:: python

    import torch_optimizer as optim

    # model = ...
    optimizer = optim.SWATS(
        model.parameters(),
        lr=1e-1,
        betas=(0.9, 0.999),
        eps=1e-3,
        weight_decay= 0.0,
        amsgrad=False,
        nesterov=False,
    )
    optimizer.step()


**Paper**: *Improving Generalization Performance by Switching from Adam to SGD* (2017) [https://arxiv.org/abs/1712.07628]

**Reference Code**: https://github.com/Mrpatekful/swats


Shampoo
-------

+-----------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------+
| .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rastrigin_Shampoo.png  |  .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rosenbrock_Shampoo.png  |
+-----------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------+

.. code:: python

    import torch_optimizer as optim

    # model = ...
    optimizer = optim.Shampoo(
        m.parameters(),
        lr=1e-1,
        momentum=0.0,
        weight_decay=0.0,
        epsilon=1e-4,
        update_freq=1,
    )
    optimizer.step()


**Paper**: *Shampoo: Preconditioned Stochastic Tensor Optimization* (2018) [https://arxiv.org/abs/1802.09568]

**Reference Code**: https://github.com/moskomule/shampoo.pytorch


Yogi
----

Yogi is optimization algorithm based on ADAM with more fine grained effective
learning rate control, and has similar theoretical guarantees on convergence as ADAM.

+--------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------+
| .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rastrigin_Yogi.png  |  .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rosenbrock_Yogi.png  |
+--------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------+

.. code:: python

    import torch_optimizer as optim

    # model = ...
    optimizer = optim.Yogi(
        m.parameters(),
        lr= 1e-2,
        betas=(0.9, 0.999),
        eps=1e-3,
        initial_accumulator=1e-6,
        weight_decay=0,
    )
    optimizer.step()


**Paper**: *Adaptive Methods for Nonconvex Optimization* (2018) [https://papers.nips.cc/paper/8186-adaptive-methods-for-nonconvex-optimization]

**Reference Code**: https://github.com/4rtemi5/Yogi-Optimizer_Keras


Adam (PyTorch built-in)
-----------------------

+---------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------+
| .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rastrigin_Adam.png   |  .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rosenbrock_Adam.png  |
+---------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------+

SGD (PyTorch built-in)
----------------------

+--------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------+
| .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rastrigin_SGD.png   |  .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rosenbrock_SGD.png  |
+--------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------+

.. _Python: https://www.python.org
.. _PyTorch: https://github.com/pytorch/pytorch
.. _Rastrigin: https://en.wikipedia.org/wiki/Rastrigin_function
.. _Rosenbrock: https://en.wikipedia.org/wiki/Rosenbrock_function
.. _benchmark: https://en.wikipedia.org/wiki/Test_functions_for_optimization
.. _optim: https://pytorch.org/docs/stable/optim.html

Changes
-------

0.3.0 (2021-10-30)
------------------
* Revert for Drop RAdam.

0.2.0 (2021-10-25)
------------------
* Drop RAdam optimizer since it is included in pytorch.
* Do not include tests as installable package.
* Preserver memory layout where possible.
* Add MADGRAD optimizer.

0.1.0 (2021-01-01)
------------------
* Initial release.
* Added support for A2GradExp, A2GradInc, A2GradUni, AccSGD, AdaBelief,
  AdaBound, AdaMod, Adafactor, Adahessian, AdamP, AggMo, Apollo,
  DiffGrad, Lamb, Lookahead, NovoGrad, PID, QHAdam, QHM, RAdam, Ranger,
  RangerQH, RangerVA, SGDP, SGDW, SWATS, Shampoo, Yogi.

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/jettify/pytorch-optimizer",
    "name": "torch-optimizer",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.6.0",
    "maintainer_email": "",
    "keywords": "torch-optimizer,pytorch,accsgd,adabound,adamod,diffgrad,lamb,lookahead,madgrad,novograd,pid,qhadam,qhm,radam,sgdw,yogi,ranger",
    "author": "Nikolay Novik",
    "author_email": "nickolainovik@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/18/13/c4c0a206131e978d8ceaa095ad1e3153d7daf48efad207b6057efe3491a2/torch-optimizer-0.3.0.tar.gz",
    "platform": "POSIX",
    "description": "torch-optimizer\n===============\n.. image:: https://github.com/jettify/pytorch-optimizer/workflows/CI/badge.svg\n   :target: https://github.com/jettify/pytorch-optimizer/actions?query=workflow%3ACI\n   :alt: GitHub Actions status for master branch\n.. image:: https://codecov.io/gh/jettify/pytorch-optimizer/branch/master/graph/badge.svg\n    :target: https://codecov.io/gh/jettify/pytorch-optimizer\n.. image:: https://img.shields.io/pypi/pyversions/torch-optimizer.svg\n    :target: https://pypi.org/project/torch-optimizer\n.. image:: https://readthedocs.org/projects/pytorch-optimizer/badge/?version=latest\n    :target: https://pytorch-optimizer.readthedocs.io/en/latest/?badge=latest\n    :alt: Documentation Status\n.. image:: https://img.shields.io/pypi/v/torch-optimizer.svg\n    :target: https://pypi.python.org/pypi/torch-optimizer\n.. image:: https://static.deepsource.io/deepsource-badge-light-mini.svg\n    :target: https://deepsource.io/gh/jettify/pytorch-optimizer/?ref=repository-badge\n\n\n**torch-optimizer** -- collection of optimizers for PyTorch_ compatible with optim_\nmodule.\n\n\nSimple example\n--------------\n\n.. code:: python\n\n    import torch_optimizer as optim\n\n    # model = ...\n    optimizer = optim.DiffGrad(model.parameters(), lr=0.001)\n    optimizer.step()\n\n\nInstallation\n------------\nInstallation process is simple, just::\n\n    $ pip install torch_optimizer\n\n\nDocumentation\n-------------\nhttps://pytorch-optimizer.rtfd.io\n\n\nCitation\n--------\nPlease cite original authors of optimization algorithms. If you like this\npackage::\n\n    @software{Novik_torchoptimizers,\n    \ttitle        = {{torch-optimizer -- collection of optimization algorithms for PyTorch.}},\n    \tauthor       = {Novik, Mykola},\n    \tyear         = 2020,\n    \tmonth        = 1,\n    \tversion      = {1.0.1}\n    }\n\nOr use github feature: \"cite this repository\" button.\n\n\nSupported Optimizers\n====================\n\n+---------------+--------------------------------------------------------------------------------------------------------------------------------------+\n|               |                                                                                                                                      |\n| `A2GradExp`_  | https://arxiv.org/abs/1810.00553                                                                                                     |\n+---------------+--------------------------------------------------------------------------------------------------------------------------------------+\n|               |                                                                                                                                      |\n| `A2GradInc`_  | https://arxiv.org/abs/1810.00553                                                                                                     |\n+---------------+--------------------------------------------------------------------------------------------------------------------------------------+\n|               |                                                                                                                                      |\n| `A2GradUni`_  | https://arxiv.org/abs/1810.00553                                                                                                     |\n+---------------+--------------------------------------------------------------------------------------------------------------------------------------+\n|               |                                                                                                                                      |\n| `AccSGD`_     | https://arxiv.org/abs/1803.05591                                                                                                     |\n+---------------+--------------------------------------------------------------------------------------------------------------------------------------+\n|               |                                                                                                                                      |\n| `AdaBelief`_  | https://arxiv.org/abs/2010.07468                                                                                                     |\n+---------------+--------------------------------------------------------------------------------------------------------------------------------------+\n|               |                                                                                                                                      |\n| `AdaBound`_   | https://arxiv.org/abs/1902.09843                                                                                                     |\n+---------------+--------------------------------------------------------------------------------------------------------------------------------------+\n|               |                                                                                                                                      |\n| `AdaMod`_     | https://arxiv.org/abs/1910.12249                                                                                                     |\n+---------------+--------------------------------------------------------------------------------------------------------------------------------------+\n|               |                                                                                                                                      |\n| `Adafactor`_  | https://arxiv.org/abs/1804.04235                                                                                                     |\n+---------------+--------------------------------------------------------------------------------------------------------------------------------------+\n|               |                                                                                                                                      |\n| `Adahessian`_ | https://arxiv.org/abs/2006.00719                                                                                                     |\n+---------------+--------------------------------------------------------------------------------------------------------------------------------------+\n|               |                                                                                                                                      |\n| `AdamP`_      | https://arxiv.org/abs/2006.08217                                                                                                     |\n+---------------+--------------------------------------------------------------------------------------------------------------------------------------+\n|               |                                                                                                                                      |\n| `AggMo`_      | https://arxiv.org/abs/1804.00325                                                                                                     |\n+---------------+--------------------------------------------------------------------------------------------------------------------------------------+\n|               |                                                                                                                                      |\n| `Apollo`_     | https://arxiv.org/abs/2009.13586                                                                                                     |\n+---------------+--------------------------------------------------------------------------------------------------------------------------------------+\n|               |                                                                                                                                      |\n| `DiffGrad`_   | https://arxiv.org/abs/1909.11015                                                                                                     |\n+---------------+--------------------------------------------------------------------------------------------------------------------------------------+\n|               |                                                                                                                                      |\n| `Lamb`_       | https://arxiv.org/abs/1904.00962                                                                                                     |\n+---------------+--------------------------------------------------------------------------------------------------------------------------------------+\n|               |                                                                                                                                      |\n| `Lookahead`_  | https://arxiv.org/abs/1907.08610                                                                                                     |\n+---------------+--------------------------------------------------------------------------------------------------------------------------------------+\n|               |                                                                                                                                      |\n| `MADGRAD`_    | https://arxiv.org/abs/2101.11075                                                                                                     |\n+---------------+--------------------------------------------------------------------------------------------------------------------------------------+\n|               |                                                                                                                                      |\n| `NovoGrad`_   | https://arxiv.org/abs/1905.11286                                                                                                     |\n+---------------+--------------------------------------------------------------------------------------------------------------------------------------+\n|               |                                                                                                                                      |\n| `PID`_        | https://www4.comp.polyu.edu.hk/~cslzhang/paper/CVPR18_PID.pdf                                                                        |\n+---------------+--------------------------------------------------------------------------------------------------------------------------------------+\n|               |                                                                                                                                      |\n| `QHAdam`_     | https://arxiv.org/abs/1810.06801                                                                                                     |\n+---------------+--------------------------------------------------------------------------------------------------------------------------------------+\n|               |                                                                                                                                      |\n| `QHM`_        | https://arxiv.org/abs/1810.06801                                                                                                     |\n+---------------+--------------------------------------------------------------------------------------------------------------------------------------+\n|               |                                                                                                                                      |\n| `RAdam`_      | https://arxiv.org/abs/1908.03265                                                                                                     |\n+---------------+--------------------------------------------------------------------------------------------------------------------------------------+\n|               |                                                                                                                                      |\n| `Ranger`_     | https://medium.com/@lessw/new-deep-learning-optimizer-ranger-synergistic-combination-of-radam-lookahead-for-the-best-of-2dc83f79a48d |\n+---------------+--------------------------------------------------------------------------------------------------------------------------------------+\n|               |                                                                                                                                      |\n| `RangerQH`_   | https://arxiv.org/abs/1810.06801                                                                                                     |\n+---------------+--------------------------------------------------------------------------------------------------------------------------------------+\n|               |                                                                                                                                      |\n| `RangerVA`_   | https://arxiv.org/abs/1908.00700v2                                                                                                   |\n+---------------+--------------------------------------------------------------------------------------------------------------------------------------+\n|               |                                                                                                                                      |\n| `SGDP`_       | https://arxiv.org/abs/2006.08217                                                                                                     |\n+---------------+--------------------------------------------------------------------------------------------------------------------------------------+\n|               |                                                                                                                                      |\n| `SGDW`_       | https://arxiv.org/abs/1608.03983                                                                                                     |\n+---------------+--------------------------------------------------------------------------------------------------------------------------------------+\n|               |                                                                                                                                      |\n| `SWATS`_      | https://arxiv.org/abs/1712.07628                                                                                                     |\n+---------------+--------------------------------------------------------------------------------------------------------------------------------------+\n|               |                                                                                                                                      |\n| `Shampoo`_    | https://arxiv.org/abs/1802.09568                                                                                                     |\n+---------------+--------------------------------------------------------------------------------------------------------------------------------------+\n|               |                                                                                                                                      |\n| `Yogi`_       | https://papers.nips.cc/paper/8186-adaptive-methods-for-nonconvex-optimization                                                        |\n+---------------+--------------------------------------------------------------------------------------------------------------------------------------+\n\n\nVisualizations\n--------------\nVisualizations help us to see how different algorithms deals with simple\nsituations like: saddle points, local minima, valleys etc, and may provide\ninteresting insights into inner workings of algorithm. Rosenbrock_ and Rastrigin_\nbenchmark_ functions was selected, because:\n\n* Rosenbrock_ (also known as banana function), is non-convex function that has\n  one global minima  `(1.0. 1.0)`. The global minimum is inside a long,\n  narrow, parabolic shaped flat valley. To find the valley is trivial. To\n  converge to the global minima, however, is difficult. Optimization\n  algorithms might pay a lot of attention to one coordinate, and have\n  problems to follow valley which is relatively flat.\n\n .. image::  https://upload.wikimedia.org/wikipedia/commons/3/32/Rosenbrock_function.svg\n\n* Rastrigin_ function is a non-convex and has one global minima in `(0.0, 0.0)`.\n  Finding the minimum of this function is a fairly difficult problem due to\n  its large search space and its large number of local minima.\n\n  .. image::  https://upload.wikimedia.org/wikipedia/commons/8/8b/Rastrigin_function.png\n\nEach optimizer performs `501` optimization steps. Learning rate is best one found\nby hyper parameter search algorithm, rest of tuning parameters are default. It\nis very easy to extend script and tune other optimizer parameters.\n\n\n.. code::\n\n    python examples/viz_optimizers.py\n\n\nWarning\n-------\nDo not pick optimizer based on visualizations, optimization approaches\nhave unique properties and may be tailored for different purposes or may\nrequire explicit learning rate schedule etc. Best way to find out, is to try one\non your particular problem and see if it improves scores.\n\nIf you do not know which optimizer to use start with built in SGD/Adam, once\ntraining logic is ready and baseline scores are established, swap optimizer and\nsee if there is any improvement.\n\n\nA2GradExp\n---------\n\n+--------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------+\n| .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rastrigin_A2GradExp.png   |  .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rosenbrock_A2GradExp.png  |\n+--------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------+\n\n.. code:: python\n\n    import torch_optimizer as optim\n\n    # model = ...\n    optimizer = optim.A2GradExp(\n        model.parameters(),\n        kappa=1000.0,\n        beta=10.0,\n        lips=10.0,\n        rho=0.5,\n    )\n    optimizer.step()\n\n\n**Paper**: *Optimal Adaptive and Accelerated Stochastic Gradient Descent* (2018) [https://arxiv.org/abs/1810.00553]\n\n**Reference Code**: https://github.com/severilov/A2Grad_optimizer\n\n\nA2GradInc\n---------\n\n+--------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------+\n| .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rastrigin_A2GradInc.png   |  .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rosenbrock_A2GradInc.png  |\n+--------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------+\n\n.. code:: python\n\n    import torch_optimizer as optim\n\n    # model = ...\n    optimizer = optim.A2GradInc(\n        model.parameters(),\n        kappa=1000.0,\n        beta=10.0,\n        lips=10.0,\n    )\n    optimizer.step()\n\n\n**Paper**: *Optimal Adaptive and Accelerated Stochastic Gradient Descent* (2018) [https://arxiv.org/abs/1810.00553]\n\n**Reference Code**: https://github.com/severilov/A2Grad_optimizer\n\n\nA2GradUni\n---------\n\n+--------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------+\n| .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rastrigin_A2GradUni.png   |  .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rosenbrock_A2GradUni.png  |\n+--------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------+\n\n.. code:: python\n\n    import torch_optimizer as optim\n\n    # model = ...\n    optimizer = optim.A2GradUni(\n        model.parameters(),\n        kappa=1000.0,\n        beta=10.0,\n        lips=10.0,\n    )\n    optimizer.step()\n\n\n**Paper**: *Optimal Adaptive and Accelerated Stochastic Gradient Descent* (2018) [https://arxiv.org/abs/1810.00553]\n\n**Reference Code**: https://github.com/severilov/A2Grad_optimizer\n\n\nAccSGD\n------\n\n+-----------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------+\n| .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rastrigin_AccSGD.png   |  .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rosenbrock_AccSGD.png  |\n+-----------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------+\n\n.. code:: python\n\n    import torch_optimizer as optim\n\n    # model = ...\n    optimizer = optim.AccSGD(\n        model.parameters(),\n        lr=1e-3,\n        kappa=1000.0,\n        xi=10.0,\n        small_const=0.7,\n        weight_decay=0\n    )\n    optimizer.step()\n\n\n**Paper**: *On the insufficiency of existing momentum schemes for Stochastic Optimization* (2019) [https://arxiv.org/abs/1803.05591]\n\n**Reference Code**: https://github.com/rahulkidambi/AccSGD\n\n\nAdaBelief\n---------\n\n+-------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------+\n| .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rastrigin_AdaBelief.png  |  .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rosenbrock_AdaBelief.png |\n+-------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------+\n\n.. code:: python\n\n    import torch_optimizer as optim\n\n    # model = ...\n    optimizer = optim.AdaBelief(\n        m.parameters(),\n        lr= 1e-3,\n        betas=(0.9, 0.999),\n        eps=1e-3,\n        weight_decay=0,\n        amsgrad=False,\n        weight_decouple=False,\n        fixed_decay=False,\n        rectify=False,\n    )\n    optimizer.step()\n\n\n**Paper**: *AdaBelief Optimizer, adapting stepsizes by the belief in observed gradients* (2020) [https://arxiv.org/abs/2010.07468]\n\n**Reference Code**: https://github.com/juntang-zhuang/Adabelief-Optimizer\n\n\nAdaBound\n--------\n\n+------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------+\n| .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rastrigin_AdaBound.png  |  .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rosenbrock_AdaBound.png |\n+------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------+\n\n.. code:: python\n\n    import torch_optimizer as optim\n\n    # model = ...\n    optimizer = optim.AdaBound(\n        m.parameters(),\n        lr= 1e-3,\n        betas= (0.9, 0.999),\n        final_lr = 0.1,\n        gamma=1e-3,\n        eps= 1e-8,\n        weight_decay=0,\n        amsbound=False,\n    )\n    optimizer.step()\n\n\n**Paper**: *Adaptive Gradient Methods with Dynamic Bound of Learning Rate* (2019) [https://arxiv.org/abs/1902.09843]\n\n**Reference Code**: https://github.com/Luolc/AdaBound\n\nAdaMod\n------\nAdaMod method restricts the adaptive learning rates with adaptive and momental\nupper bounds. The dynamic learning rate bounds are based on the exponential\nmoving averages of the adaptive learning rates themselves, which smooth out\nunexpected large learning rates and stabilize the training of deep neural networks.\n\n+------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------+\n| .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rastrigin_AdaMod.png    |  .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rosenbrock_AdaMod.png   |\n+------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------+\n\n.. code:: python\n\n    import torch_optimizer as optim\n\n    # model = ...\n    optimizer = optim.AdaMod(\n        m.parameters(),\n        lr= 1e-3,\n        betas=(0.9, 0.999),\n        beta3=0.999,\n        eps=1e-8,\n        weight_decay=0,\n    )\n    optimizer.step()\n\n**Paper**: *An Adaptive and Momental Bound Method for Stochastic Learning.* (2019) [https://arxiv.org/abs/1910.12249]\n\n**Reference Code**: https://github.com/lancopku/AdaMod\n\n\nAdafactor\n---------\n+------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------+\n| .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rastrigin_Adafactor.png |  .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rosenbrock_Adafactor.png |\n+------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------+\n\n.. code:: python\n\n    import torch_optimizer as optim\n\n    # model = ...\n    optimizer = optim.Adafactor(\n        m.parameters(),\n        lr= 1e-3,\n        eps2= (1e-30, 1e-3),\n        clip_threshold=1.0,\n        decay_rate=-0.8,\n        beta1=None,\n        weight_decay=0.0,\n        scale_parameter=True,\n        relative_step=True,\n        warmup_init=False,\n    )\n    optimizer.step()\n\n**Paper**: *Adafactor: Adaptive Learning Rates with Sublinear Memory Cost.* (2018) [https://arxiv.org/abs/1804.04235]\n\n**Reference Code**: https://github.com/pytorch/fairseq/blob/master/fairseq/optim/adafactor.py\n\n\nAdahessian\n----------\n+-------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------+\n| .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rastrigin_Adahessian.png |  .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rosenbrock_Adahessian.png  |\n+-------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------+\n\n.. code:: python\n\n    import torch_optimizer as optim\n\n    # model = ...\n    optimizer = optim.Adahessian(\n        m.parameters(),\n        lr= 1.0,\n        betas= (0.9, 0.999),\n        eps= 1e-4,\n        weight_decay=0.0,\n        hessian_power=1.0,\n    )\n\t  loss_fn(m(input), target).backward(create_graph = True) # create_graph=True is necessary for Hessian calculation\n    optimizer.step()\n\n\n**Paper**: *ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning* (2020) [https://arxiv.org/abs/2006.00719]\n\n**Reference Code**: https://github.com/amirgholami/adahessian\n\n\nAdamP\n------\nAdamP propose a simple and effective solution: at each iteration of Adam optimizer\napplied on scale-invariant weights (e.g., Conv weights preceding a BN layer), AdamP\nremove the radial component (i.e., parallel to the weight vector) from the update vector.\nIntuitively, this operation prevents the unnecessary update along the radial direction\nthat only increases the weight norm without contributing to the loss minimization.\n\n+------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------+\n| .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rastrigin_AdamP.png     |  .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rosenbrock_AdamP.png    |\n+------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------+\n\n.. code:: python\n\n    import torch_optimizer as optim\n\n    # model = ...\n    optimizer = optim.AdamP(\n        m.parameters(),\n        lr= 1e-3,\n        betas=(0.9, 0.999),\n        eps=1e-8,\n        weight_decay=0,\n        delta = 0.1,\n        wd_ratio = 0.1\n    )\n    optimizer.step()\n\n**Paper**: *Slowing Down the Weight Norm Increase in Momentum-based Optimizers.* (2020) [https://arxiv.org/abs/2006.08217]\n\n**Reference Code**: https://github.com/clovaai/AdamP\n\n\nAggMo\n-----\n\n+------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------+\n| .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rastrigin_AggMo.png     |  .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rosenbrock_AggMo.png    |\n+------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------+\n\n.. code:: python\n\n    import torch_optimizer as optim\n\n    # model = ...\n    optimizer = optim.AggMo(\n        m.parameters(),\n        lr= 1e-3,\n        betas=(0.0, 0.9, 0.99),\n        weight_decay=0,\n    )\n    optimizer.step()\n\n**Paper**: *Aggregated Momentum: Stability Through Passive Damping.* (2019) [https://arxiv.org/abs/1804.00325]\n\n**Reference Code**: https://github.com/AtheMathmo/AggMo\n\n\nApollo\n------\n\n+------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------+\n| .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rastrigin_Apollo.png    |  .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rosenbrock_Apollo.png   |\n+------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------+\n\n.. code:: python\n\n    import torch_optimizer as optim\n\n    # model = ...\n    optimizer = optim.Apollo(\n        m.parameters(),\n        lr= 1e-2,\n        beta=0.9,\n        eps=1e-4,\n        warmup=0,\n        init_lr=0.01,\n        weight_decay=0,\n    )\n    optimizer.step()\n\n**Paper**: *Apollo: An Adaptive Parameter-wise Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimization.* (2020) [https://arxiv.org/abs/2009.13586]\n\n**Reference Code**: https://github.com/XuezheMax/apollo\n\n\nDiffGrad\n--------\nOptimizer based on the difference between the present and the immediate past\ngradient, the step size is adjusted for each parameter in such\na way that it should have a larger step size for faster gradient changing\nparameters and a lower step size for lower gradient changing parameters.\n\n+------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------+\n| .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rastrigin_DiffGrad.png  |  .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rosenbrock_DiffGrad.png  |\n+------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------+\n\n.. code:: python\n\n    import torch_optimizer as optim\n\n    # model = ...\n    optimizer = optim.DiffGrad(\n        m.parameters(),\n        lr= 1e-3,\n        betas=(0.9, 0.999),\n        eps=1e-8,\n        weight_decay=0,\n    )\n    optimizer.step()\n\n\n**Paper**: *diffGrad: An Optimization Method for Convolutional Neural Networks.* (2019) [https://arxiv.org/abs/1909.11015]\n\n**Reference Code**: https://github.com/shivram1987/diffGrad\n\nLamb\n----\n\n+--------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------+\n| .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rastrigin_Lamb.png  |  .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rosenbrock_Lamb.png  |\n+--------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------+\n\n.. code:: python\n\n    import torch_optimizer as optim\n\n    # model = ...\n    optimizer = optim.Lamb(\n        m.parameters(),\n        lr= 1e-3,\n        betas=(0.9, 0.999),\n        eps=1e-8,\n        weight_decay=0,\n    )\n    optimizer.step()\n\n\n**Paper**: *Large Batch Optimization for Deep Learning: Training BERT in 76 minutes* (2019) [https://arxiv.org/abs/1904.00962]\n\n**Reference Code**: https://github.com/cybertronai/pytorch-lamb\n\nLookahead\n---------\n\n+-----------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------+\n| .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rastrigin_LookaheadYogi.png  |  .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rosenbrock_LookaheadYogi.png  |\n+-----------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------+\n\n.. code:: python\n\n    import torch_optimizer as optim\n\n    # model = ...\n    # base optimizer, any other optimizer can be used like Adam or DiffGrad\n    yogi = optim.Yogi(\n        m.parameters(),\n        lr= 1e-2,\n        betas=(0.9, 0.999),\n        eps=1e-3,\n        initial_accumulator=1e-6,\n        weight_decay=0,\n    )\n\n    optimizer = optim.Lookahead(yogi, k=5, alpha=0.5)\n    optimizer.step()\n\n\n**Paper**: *Lookahead Optimizer: k steps forward, 1 step back* (2019) [https://arxiv.org/abs/1907.08610]\n\n**Reference Code**: https://github.com/alphadl/lookahead.pytorch\n\n\nMADGRAD\n---------\n\n+-----------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------+\n| .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rastrigin_MADGRAD.png        |  .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rosenbrock_MADGRAD.png        |\n+-----------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------+\n\n.. code:: python\n\n    import torch_optimizer as optim\n\n    # model = ...\n    optimizer = optim.MADGRAD(\n        m.parameters(),\n        lr=1e-2,\n        momentum=0.9,\n        weight_decay=0,\n        eps=1e-6,\n    )\n    optimizer.step()\n\n\n**Paper**: *Adaptivity without Compromise: A Momentumized, Adaptive, Dual Averaged Gradient Method for Stochastic Optimization* (2021) [https://arxiv.org/abs/2101.11075]\n\n**Reference Code**: https://github.com/facebookresearch/madgrad\n\n\nNovoGrad\n--------\n\n+------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------+\n| .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rastrigin_NovoGrad.png  |  .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rosenbrock_NovoGrad.png  |\n+------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------+\n\n.. code:: python\n\n    import torch_optimizer as optim\n\n    # model = ...\n    optimizer = optim.NovoGrad(\n        m.parameters(),\n        lr= 1e-3,\n        betas=(0.9, 0.999),\n        eps=1e-8,\n        weight_decay=0,\n        grad_averaging=False,\n        amsgrad=False,\n    )\n    optimizer.step()\n\n\n**Paper**: *Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networks* (2019) [https://arxiv.org/abs/1905.11286]\n\n**Reference Code**: https://github.com/NVIDIA/DeepLearningExamples/\n\n\nPID\n---\n\n+-------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------+\n| .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rastrigin_PID.png  |  .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rosenbrock_PID.png  |\n+-------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------+\n\n.. code:: python\n\n    import torch_optimizer as optim\n\n    # model = ...\n    optimizer = optim.PID(\n        m.parameters(),\n        lr=1e-3,\n        momentum=0,\n        dampening=0,\n        weight_decay=1e-2,\n        integral=5.0,\n        derivative=10.0,\n    )\n    optimizer.step()\n\n\n**Paper**: *A PID Controller Approach for Stochastic Optimization of Deep Networks* (2018) [http://www4.comp.polyu.edu.hk/~cslzhang/paper/CVPR18_PID.pdf]\n\n**Reference Code**: https://github.com/tensorboy/PIDOptimizer\n\n\nQHAdam\n------\n\n+----------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------+\n| .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rastrigin_QHAdam.png  |  .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rosenbrock_QHAdam.png  |\n+----------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------+\n\n.. code:: python\n\n    import torch_optimizer as optim\n\n    # model = ...\n    optimizer = optim.QHAdam(\n        m.parameters(),\n        lr= 1e-3,\n        betas=(0.9, 0.999),\n        nus=(1.0, 1.0),\n        weight_decay=0,\n        decouple_weight_decay=False,\n        eps=1e-8,\n    )\n    optimizer.step()\n\n\n**Paper**: *Quasi-hyperbolic momentum and Adam for deep learning* (2019) [https://arxiv.org/abs/1810.06801]\n\n**Reference Code**: https://github.com/facebookresearch/qhoptim\n\n\nQHM\n---\n\n+-------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------+\n| .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rastrigin_QHM.png  |  .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rosenbrock_QHM.png  |\n+-------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------+\n\n.. code:: python\n\n    import torch_optimizer as optim\n\n    # model = ...\n    optimizer = optim.QHM(\n        m.parameters(),\n        lr=1e-3,\n        momentum=0,\n        nu=0.7,\n        weight_decay=1e-2,\n        weight_decay_type='grad',\n    )\n    optimizer.step()\n\n\n**Paper**: *Quasi-hyperbolic momentum and Adam for deep learning* (2019) [https://arxiv.org/abs/1810.06801]\n\n**Reference Code**: https://github.com/facebookresearch/qhoptim\n\n\nRAdam\n-----\n\n+---------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+\n| .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rastrigin_RAdam.png  |  .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rosenbrock_RAdam.png  |\n+---------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+\n\n.. code:: python\n\n    import torch_optimizer as optim\n\n    # model = ...\n    optimizer = optim.RAdam(\n        m.parameters(),\n        lr= 1e-3,\n        betas=(0.9, 0.999),\n        eps=1e-8,\n        weight_decay=0,\n    )\n    optimizer.step()\n\n\n**Paper**: *On the Variance of the Adaptive Learning Rate and Beyond* (2019) [https://arxiv.org/abs/1908.03265]\n\n**Reference Code**: https://github.com/LiyuanLucasLiu/RAdam\n\n\nRanger\n------\n\n+----------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------+\n| .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rastrigin_Ranger.png  |  .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rosenbrock_Ranger.png  |\n+----------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------+\n\n.. code:: python\n\n    import torch_optimizer as optim\n\n    # model = ...\n    optimizer = optim.Ranger(\n        m.parameters(),\n        lr=1e-3,\n        alpha=0.5,\n        k=6,\n        N_sma_threshhold=5,\n        betas=(.95, 0.999),\n        eps=1e-5,\n        weight_decay=0\n    )\n    optimizer.step()\n\n\n**Paper**: *New Deep Learning Optimizer, Ranger: Synergistic combination of RAdam + LookAhead for the best of both* (2019) [https://medium.com/@lessw/new-deep-learning-optimizer-ranger-synergistic-combination-of-radam-lookahead-for-the-best-of-2dc83f79a48d]\n\n**Reference Code**: https://github.com/lessw2020/Ranger-Deep-Learning-Optimizer\n\n\nRangerQH\n--------\n\n+------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------+\n| .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rastrigin_RangerQH.png  |  .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rosenbrock_RangerQH.png  |\n+------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------+\n\n.. code:: python\n\n    import torch_optimizer as optim\n\n    # model = ...\n    optimizer = optim.RangerQH(\n        m.parameters(),\n        lr=1e-3,\n        betas=(0.9, 0.999),\n        nus=(.7, 1.0),\n        weight_decay=0.0,\n        k=6,\n        alpha=.5,\n        decouple_weight_decay=False,\n        eps=1e-8,\n    )\n    optimizer.step()\n\n\n**Paper**: *Quasi-hyperbolic momentum and Adam for deep learning* (2018) [https://arxiv.org/abs/1810.06801]\n\n**Reference Code**: https://github.com/lessw2020/Ranger-Deep-Learning-Optimizer\n\n\nRangerVA\n--------\n\n+------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------+\n| .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rastrigin_RangerVA.png  |  .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rosenbrock_RangerVA.png  |\n+------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------+\n\n.. code:: python\n\n    import torch_optimizer as optim\n\n    # model = ...\n    optimizer = optim.RangerVA(\n        m.parameters(),\n        lr=1e-3,\n        alpha=0.5,\n        k=6,\n        n_sma_threshhold=5,\n        betas=(.95, 0.999),\n        eps=1e-5,\n        weight_decay=0,\n        amsgrad=True,\n        transformer='softplus',\n        smooth=50,\n        grad_transformer='square'\n    )\n    optimizer.step()\n\n\n**Paper**: *Calibrating the Adaptive Learning Rate to Improve Convergence of ADAM* (2019) [https://arxiv.org/abs/1908.00700v2]\n\n**Reference Code**: https://github.com/lessw2020/Ranger-Deep-Learning-Optimizer\n\n\nSGDP\n----\n\n+--------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------+\n| .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rastrigin_SGDP.png  |  .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rosenbrock_SGDP.png  |\n+--------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------+\n\n.. code:: python\n\n    import torch_optimizer as optim\n\n    # model = ...\n    optimizer = optim.SGDP(\n        m.parameters(),\n        lr= 1e-3,\n        momentum=0,\n        dampening=0,\n        weight_decay=1e-2,\n        nesterov=False,\n        delta = 0.1,\n        wd_ratio = 0.1\n    )\n    optimizer.step()\n\n\n**Paper**: *Slowing Down the Weight Norm Increase in Momentum-based Optimizers.* (2020) [https://arxiv.org/abs/2006.08217]\n\n**Reference Code**: https://github.com/clovaai/AdamP\n\n\nSGDW\n----\n\n+--------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------+\n| .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rastrigin_SGDW.png  |  .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rosenbrock_SGDW.png  |\n+--------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------+\n\n.. code:: python\n\n    import torch_optimizer as optim\n\n    # model = ...\n    optimizer = optim.SGDW(\n        m.parameters(),\n        lr= 1e-3,\n        momentum=0,\n        dampening=0,\n        weight_decay=1e-2,\n        nesterov=False,\n    )\n    optimizer.step()\n\n\n**Paper**: *SGDR: Stochastic Gradient Descent with Warm Restarts* (2017) [https://arxiv.org/abs/1608.03983]\n\n**Reference Code**: https://github.com/pytorch/pytorch/pull/22466\n\n\nSWATS\n-----\n\n+---------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+\n| .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rastrigin_SWATS.png  |  .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rosenbrock_SWATS.png  |\n+---------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+\n\n.. code:: python\n\n    import torch_optimizer as optim\n\n    # model = ...\n    optimizer = optim.SWATS(\n        model.parameters(),\n        lr=1e-1,\n        betas=(0.9, 0.999),\n        eps=1e-3,\n        weight_decay= 0.0,\n        amsgrad=False,\n        nesterov=False,\n    )\n    optimizer.step()\n\n\n**Paper**: *Improving Generalization Performance by Switching from Adam to SGD* (2017) [https://arxiv.org/abs/1712.07628]\n\n**Reference Code**: https://github.com/Mrpatekful/swats\n\n\nShampoo\n-------\n\n+-----------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------+\n| .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rastrigin_Shampoo.png  |  .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rosenbrock_Shampoo.png  |\n+-----------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------+\n\n.. code:: python\n\n    import torch_optimizer as optim\n\n    # model = ...\n    optimizer = optim.Shampoo(\n        m.parameters(),\n        lr=1e-1,\n        momentum=0.0,\n        weight_decay=0.0,\n        epsilon=1e-4,\n        update_freq=1,\n    )\n    optimizer.step()\n\n\n**Paper**: *Shampoo: Preconditioned Stochastic Tensor Optimization* (2018) [https://arxiv.org/abs/1802.09568]\n\n**Reference Code**: https://github.com/moskomule/shampoo.pytorch\n\n\nYogi\n----\n\nYogi is optimization algorithm based on ADAM with more fine grained effective\nlearning rate control, and has similar theoretical guarantees on convergence as ADAM.\n\n+--------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------+\n| .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rastrigin_Yogi.png  |  .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rosenbrock_Yogi.png  |\n+--------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------+\n\n.. code:: python\n\n    import torch_optimizer as optim\n\n    # model = ...\n    optimizer = optim.Yogi(\n        m.parameters(),\n        lr= 1e-2,\n        betas=(0.9, 0.999),\n        eps=1e-3,\n        initial_accumulator=1e-6,\n        weight_decay=0,\n    )\n    optimizer.step()\n\n\n**Paper**: *Adaptive Methods for Nonconvex Optimization* (2018) [https://papers.nips.cc/paper/8186-adaptive-methods-for-nonconvex-optimization]\n\n**Reference Code**: https://github.com/4rtemi5/Yogi-Optimizer_Keras\n\n\nAdam (PyTorch built-in)\n-----------------------\n\n+---------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------+\n| .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rastrigin_Adam.png   |  .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rosenbrock_Adam.png  |\n+---------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------+\n\nSGD (PyTorch built-in)\n----------------------\n\n+--------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------+\n| .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rastrigin_SGD.png   |  .. image:: https://raw.githubusercontent.com/jettify/pytorch-optimizer/master/docs/rosenbrock_SGD.png  |\n+--------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------+\n\n.. _Python: https://www.python.org\n.. _PyTorch: https://github.com/pytorch/pytorch\n.. _Rastrigin: https://en.wikipedia.org/wiki/Rastrigin_function\n.. _Rosenbrock: https://en.wikipedia.org/wiki/Rosenbrock_function\n.. _benchmark: https://en.wikipedia.org/wiki/Test_functions_for_optimization\n.. _optim: https://pytorch.org/docs/stable/optim.html\n\nChanges\n-------\n\n0.3.0 (2021-10-30)\n------------------\n* Revert for Drop RAdam.\n\n0.2.0 (2021-10-25)\n------------------\n* Drop RAdam optimizer since it is included in pytorch.\n* Do not include tests as installable package.\n* Preserver memory layout where possible.\n* Add MADGRAD optimizer.\n\n0.1.0 (2021-01-01)\n------------------\n* Initial release.\n* Added support for A2GradExp, A2GradInc, A2GradUni, AccSGD, AdaBelief,\n  AdaBound, AdaMod, Adafactor, Adahessian, AdamP, AggMo, Apollo,\n  DiffGrad, Lamb, Lookahead, NovoGrad, PID, QHAdam, QHM, RAdam, Ranger,\n  RangerQH, RangerVA, SGDP, SGDW, SWATS, Shampoo, Yogi.\n\n",
    "bugtrack_url": null,
    "license": "Apache 2",
    "summary": "pytorch-optimizer",
    "version": "0.3.0",
    "project_urls": {
        "Documentation": "https://pytorch-optimizer.readthedocs.io",
        "Download": "https://pypi.org/project/torch-optimizer/",
        "Homepage": "https://github.com/jettify/pytorch-optimizer",
        "Issues": "https://github.com/jettify/pytorch-optimizer/issues",
        "Website": "https://github.com/jettify/pytorch-optimizer"
    },
    "split_keywords": [
        "torch-optimizer",
        "pytorch",
        "accsgd",
        "adabound",
        "adamod",
        "diffgrad",
        "lamb",
        "lookahead",
        "madgrad",
        "novograd",
        "pid",
        "qhadam",
        "qhm",
        "radam",
        "sgdw",
        "yogi",
        "ranger"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "f654bbb1b4c15afc2dac525c8359c340ade685542113394fd4c6564ee3c71da3",
                "md5": "43e2ffdf564e56a782df458c84c586e5",
                "sha256": "7de8e57315e43561cdd0370a1b67303cc8ef1b053f9b5573de629a62390f2af9"
            },
            "downloads": -1,
            "filename": "torch_optimizer-0.3.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "43e2ffdf564e56a782df458c84c586e5",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6.0",
            "size": 61897,
            "upload_time": "2021-10-31T03:00:19",
            "upload_time_iso_8601": "2021-10-31T03:00:19.812935Z",
            "url": "https://files.pythonhosted.org/packages/f6/54/bbb1b4c15afc2dac525c8359c340ade685542113394fd4c6564ee3c71da3/torch_optimizer-0.3.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "1813c4c0a206131e978d8ceaa095ad1e3153d7daf48efad207b6057efe3491a2",
                "md5": "8d1d81b4266c3f77017a33c1bdecda82",
                "sha256": "b2180629df9d6cd7a2aeabe71fa4a872bba938e8e275965092568cd9931b924c"
            },
            "downloads": -1,
            "filename": "torch-optimizer-0.3.0.tar.gz",
            "has_sig": false,
            "md5_digest": "8d1d81b4266c3f77017a33c1bdecda82",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6.0",
            "size": 54409,
            "upload_time": "2021-10-31T03:00:22",
            "upload_time_iso_8601": "2021-10-31T03:00:22.084776Z",
            "url": "https://files.pythonhosted.org/packages/18/13/c4c0a206131e978d8ceaa095ad1e3153d7daf48efad207b6057efe3491a2/torch-optimizer-0.3.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2021-10-31 03:00:22",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "jettify",
    "github_project": "pytorch-optimizer",
    "travis_ci": false,
    "coveralls": true,
    "github_actions": true,
    "lcname": "torch-optimizer"
}

Nikolay Novik