kron-torch


Namekron-torch JSON
Version 0.2.3 PyPI version JSON
download
home_pageNone
SummaryAn implementation of PSGD Kron optimizer in PyTorch.
upload_time2024-10-29 21:28:56
maintainerNone
docs_urlNone
authorEvan Walters, Omead Pooladzandi, Xi-Lin Li
requires_python>=3.9
licenseNone
keywords python machine learning optimization pytorch
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # PSGD Kron

For original PSGD repo, see [psgd_torch](https://github.com/lixilinx/psgd_torch).

For JAX version, see [psgd_jax](https://github.com/evanatyourservice/psgd_jax).

Implementations of [PSGD optimizers](https://github.com/lixilinx/psgd_torch) in JAX (optax-style). 
PSGD is a second-order optimizer originally created by Xi-Lin Li that uses either a hessian-based 
or whitening-based (gg^T) preconditioner and lie groups to improve training convergence, 
generalization, and efficiency. I highly suggest taking a look at Xi-Lin's PSGD repo's readme linked
to above for interesting details on how PSGD works and experiments using PSGD. There are also 
paper resources listed near the bottom of this readme.

### `kron`:

The most versatile and easy-to-use PSGD optimizer is `kron`, which uses a Kronecker-factored 
preconditioner. It has less hyperparameters that need tuning than adam, and can generally act as a 
drop-in replacement.

## Installation

```bash
pip install kron-torch
```

## Basic Usage (Kron)

Kron schedules the preconditioner update probability by default to start at 1.0 and anneal to 0.03 
at the beginning of training, so training will be slightly slower at the start but will speed up 
by around 4k steps.

For basic usage, use `kron` optimizer like any other pytorch optimizer:

```python
from kron_torch import Kron

optimizer = Kron(params)

optimizer.zero_grad()
loss.backward()
optimizer.step()
```

**Basic hyperparameters:**

TLDR: Learning rate and weight decay act similarly to adam's, start with adam-like settings and go 
from there. There is no b2 or epsilon.

These next 3 settings control whether a dimension's preconditioner is diagonal or triangular. 
For example, for a layer with shape (256, 128), triagular preconditioners would be shapes (256, 256)
and (128, 128), and diagonal preconditioners would be shapes (256,) and (128,). Depending on how 
these settings are chosen, `kron` can balance between memory/speed and effectiveness. Defaults lead
to most precoditioners being triangular except for 1-dimensional layers and very large dimensions.

`max_size_triangular`: Any dimension with size above this value will have a diagonal preconditioner.

`min_ndim_triangular`: Any tensor with less than this number of dims will have all diagonal 
preconditioners. Default is 2, so single-dim layers like bias and scale will use diagonal
preconditioners.

`memory_save_mode`: Can be None, 'one_diag', or 'all_diag'. None is default and lets all 
preconditioners be triangular. 'one_diag' sets the largest or last dim per layer as diagonal 
using `np.argsort(shape)[::-1][0]`. 'all_diag' sets all preconditioners to be diagonal.

`preconditioner_update_probability`: Preconditioner update probability uses a schedule by default 
that works well for most cases. It anneals from 1 to 0.03 at the beginning of training, so training 
will be slightly slower at the start but will speed up by around 4k steps. PSGD generally benefits
from more preconditioner updates at the start of training, but once the preconditioner is learned 
it's okay to do them less often. An easy way to adjust update frequency is to adjust `min_prob` 
from `precond_update_prob_schedule`.

This is the default schedule from the `precond_update_prob_schedule` function at the top of kron.py:

<img src="assets/default_schedule.png" alt="Default Schedule" width="800" style="max-width: 100%; height: auto;" />


## Resources

PSGD papers and resources listed from Xi-Lin's repo

1) Xi-Lin Li. Preconditioned stochastic gradient descent, [arXiv:1512.04202](https://arxiv.org/abs/1512.04202), 2015. (General ideas of PSGD, preconditioner fitting losses and Kronecker product preconditioners.)
2) Xi-Lin Li. Preconditioner on matrix Lie group for SGD, [arXiv:1809.10232](https://arxiv.org/abs/1809.10232), 2018. (Focus on preconditioners with the affine Lie group.)
3) Xi-Lin Li. Black box Lie group preconditioners for SGD, [arXiv:2211.04422](https://arxiv.org/abs/2211.04422), 2022. (Mainly about the LRA preconditioner. See [these supplementary materials](https://drive.google.com/file/d/1CTNx1q67_py87jn-0OI-vSLcsM1K7VsM/view) for detailed math derivations.)
4) Xi-Lin Li. Stochastic Hessian fittings on Lie groups, [arXiv:2402.11858](https://arxiv.org/abs/2402.11858), 2024. (Some theoretical works on the efficiency of PSGD. The Hessian fitting problem is shown to be strongly convex on set ${\rm GL}(n, \mathbb{R})/R_{\rm polar}$.)
5) Omead Pooladzandi, Xi-Lin Li. Curvature-informed SGD via general purpose Lie-group preconditioners, [arXiv:2402.04553](https://arxiv.org/abs/2402.04553), 2024. (Plenty of benchmark results and analyses for PSGD vs. other optimizers.)


## License

[![CC BY 4.0][cc-by-image]][cc-by]

This work is licensed under a [Creative Commons Attribution 4.0 International License][cc-by].

2024 Evan Walters, Omead Pooladzandi, Xi-Lin Li


[cc-by]: http://creativecommons.org/licenses/by/4.0/
[cc-by-image]: https://licensebuttons.net/l/by/4.0/88x31.png
[cc-by-shield]: https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg


            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "kron-torch",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "python, machine learning, optimization, pytorch",
    "author": "Evan Walters, Omead Pooladzandi, Xi-Lin Li",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/b3/70/0d87e3b24bccb5272b34e7427279224fe1e56ca52b0ba7707a5734cb2fa8/kron_torch-0.2.3.tar.gz",
    "platform": null,
    "description": "# PSGD Kron\n\nFor original PSGD repo, see [psgd_torch](https://github.com/lixilinx/psgd_torch).\n\nFor JAX version, see [psgd_jax](https://github.com/evanatyourservice/psgd_jax).\n\nImplementations of [PSGD optimizers](https://github.com/lixilinx/psgd_torch) in JAX (optax-style). \nPSGD is a second-order optimizer originally created by Xi-Lin Li that uses either a hessian-based \nor whitening-based (gg^T) preconditioner and lie groups to improve training convergence, \ngeneralization, and efficiency. I highly suggest taking a look at Xi-Lin's PSGD repo's readme linked\nto above for interesting details on how PSGD works and experiments using PSGD. There are also \npaper resources listed near the bottom of this readme.\n\n### `kron`:\n\nThe most versatile and easy-to-use PSGD optimizer is `kron`, which uses a Kronecker-factored \npreconditioner. It has less hyperparameters that need tuning than adam, and can generally act as a \ndrop-in replacement.\n\n## Installation\n\n```bash\npip install kron-torch\n```\n\n## Basic Usage (Kron)\n\nKron schedules the preconditioner update probability by default to start at 1.0 and anneal to 0.03 \nat the beginning of training, so training will be slightly slower at the start but will speed up \nby around 4k steps.\n\nFor basic usage, use `kron` optimizer like any other pytorch optimizer:\n\n```python\nfrom kron_torch import Kron\n\noptimizer = Kron(params)\n\noptimizer.zero_grad()\nloss.backward()\noptimizer.step()\n```\n\n**Basic hyperparameters:**\n\nTLDR: Learning rate and weight decay act similarly to adam's, start with adam-like settings and go \nfrom there. There is no b2 or epsilon.\n\nThese next 3 settings control whether a dimension's preconditioner is diagonal or triangular. \nFor example, for a layer with shape (256, 128), triagular preconditioners would be shapes (256, 256)\nand (128, 128), and diagonal preconditioners would be shapes (256,) and (128,). Depending on how \nthese settings are chosen, `kron` can balance between memory/speed and effectiveness. Defaults lead\nto most precoditioners being triangular except for 1-dimensional layers and very large dimensions.\n\n`max_size_triangular`: Any dimension with size above this value will have a diagonal preconditioner.\n\n`min_ndim_triangular`: Any tensor with less than this number of dims will have all diagonal \npreconditioners. Default is 2, so single-dim layers like bias and scale will use diagonal\npreconditioners.\n\n`memory_save_mode`: Can be None, 'one_diag', or 'all_diag'. None is default and lets all \npreconditioners be triangular. 'one_diag' sets the largest or last dim per layer as diagonal \nusing `np.argsort(shape)[::-1][0]`. 'all_diag' sets all preconditioners to be diagonal.\n\n`preconditioner_update_probability`: Preconditioner update probability uses a schedule by default \nthat works well for most cases. It anneals from 1 to 0.03 at the beginning of training, so training \nwill be slightly slower at the start but will speed up by around 4k steps. PSGD generally benefits\nfrom more preconditioner updates at the start of training, but once the preconditioner is learned \nit's okay to do them less often. An easy way to adjust update frequency is to adjust `min_prob` \nfrom `precond_update_prob_schedule`.\n\nThis is the default schedule from the `precond_update_prob_schedule` function at the top of kron.py:\n\n<img src=\"assets/default_schedule.png\" alt=\"Default Schedule\" width=\"800\" style=\"max-width: 100%; height: auto;\" />\n\n\n## Resources\n\nPSGD papers and resources listed from Xi-Lin's repo\n\n1) Xi-Lin Li. Preconditioned stochastic gradient descent, [arXiv:1512.04202](https://arxiv.org/abs/1512.04202), 2015. (General ideas of PSGD, preconditioner fitting losses and Kronecker product preconditioners.)\n2) Xi-Lin Li. Preconditioner on matrix Lie group for SGD, [arXiv:1809.10232](https://arxiv.org/abs/1809.10232), 2018. (Focus on preconditioners with the affine Lie group.)\n3) Xi-Lin Li. Black box Lie group preconditioners for SGD, [arXiv:2211.04422](https://arxiv.org/abs/2211.04422), 2022. (Mainly about the LRA preconditioner. See [these supplementary materials](https://drive.google.com/file/d/1CTNx1q67_py87jn-0OI-vSLcsM1K7VsM/view) for detailed math derivations.)\n4) Xi-Lin Li. Stochastic Hessian fittings on Lie groups, [arXiv:2402.11858](https://arxiv.org/abs/2402.11858), 2024. (Some theoretical works on the efficiency of PSGD. The Hessian fitting problem is shown to be strongly convex on set ${\\rm GL}(n, \\mathbb{R})/R_{\\rm polar}$.)\n5) Omead Pooladzandi, Xi-Lin Li. Curvature-informed SGD via general purpose Lie-group preconditioners, [arXiv:2402.04553](https://arxiv.org/abs/2402.04553), 2024. (Plenty of benchmark results and analyses for PSGD vs. other optimizers.)\n\n\n## License\n\n[![CC BY 4.0][cc-by-image]][cc-by]\n\nThis work is licensed under a [Creative Commons Attribution 4.0 International License][cc-by].\n\n2024 Evan Walters, Omead Pooladzandi, Xi-Lin Li\n\n\n[cc-by]: http://creativecommons.org/licenses/by/4.0/\n[cc-by-image]: https://licensebuttons.net/l/by/4.0/88x31.png\n[cc-by-shield]: https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg\n\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "An implementation of PSGD Kron optimizer in PyTorch.",
    "version": "0.2.3",
    "project_urls": {
        "homepage": "https://github.com/evanatyourservice/kron_torch",
        "repository": "https://github.com/evanatyourservice/kron_torch"
    },
    "split_keywords": [
        "python",
        " machine learning",
        " optimization",
        " pytorch"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "f4657620ce111dd285ac574a918eb1ced1e530df93c826fe1ae04c51c4e828f5",
                "md5": "a7c241fef36ba26a578bd832dd65d8fc",
                "sha256": "229e24b6b8feacfc956e51879c0bcee8d15e0c0b5b4a4026b72a64d7445ff9f4"
            },
            "downloads": -1,
            "filename": "kron_torch-0.2.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "a7c241fef36ba26a578bd832dd65d8fc",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 16386,
            "upload_time": "2024-10-29T21:28:54",
            "upload_time_iso_8601": "2024-10-29T21:28:54.904226Z",
            "url": "https://files.pythonhosted.org/packages/f4/65/7620ce111dd285ac574a918eb1ced1e530df93c826fe1ae04c51c4e828f5/kron_torch-0.2.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "b3700d87e3b24bccb5272b34e7427279224fe1e56ca52b0ba7707a5734cb2fa8",
                "md5": "75dbda88337324932a2785f19d12cf93",
                "sha256": "d13214ad371514fcc855d542f3a786765e6b52cd288683b2fe1f44ab0a9a0579"
            },
            "downloads": -1,
            "filename": "kron_torch-0.2.3.tar.gz",
            "has_sig": false,
            "md5_digest": "75dbda88337324932a2785f19d12cf93",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 17237,
            "upload_time": "2024-10-29T21:28:56",
            "upload_time_iso_8601": "2024-10-29T21:28:56.239670Z",
            "url": "https://files.pythonhosted.org/packages/b3/70/0d87e3b24bccb5272b34e7427279224fe1e56ca52b0ba7707a5734cb2fa8/kron_torch-0.2.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-10-29 21:28:56",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "evanatyourservice",
    "github_project": "kron_torch",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "kron-torch"
}
        
Elapsed time: 0.42824s