Name | kron-torch JSON |

Version | 0.2.1 JSON |

download | |

home_page | None |

Summary | An implementation of PSGD Kron optimizer in PyTorch. |

upload_time | 2024-10-07 17:13:51 |

maintainer | None |

docs_url | None |

author | Evan Walters, Omead Pooladzandi, Xi-Lin Li |

requires_python | >=3.9 |

license | None |

keywords | python machine learning optimization pytorch |

VCS | |

bugtrack_url | |

requirements | No requirements were recorded. |

Travis-CI | No Travis. |

coveralls test coverage | No coveralls. |

# PSGD Kron For original PSGD repo, see [psgd_torch](https://github.com/lixilinx/psgd_torch). For JAX version, see [psgd_jax](https://github.com/evanatyourservice/psgd_jax). Implementation of [PSGD Kron optimizer](https://github.com/lixilinx/psgd_torch) in PyTorch. PSGD is a second-order optimizer originally created by Xi-Lin Li that uses either a hessian-based or whitening-based (gg^T) preconditioner and lie groups to improve training convergence, generalization, and efficiency. I highly suggest taking a look at Xi-Lin's PSGD repo's readme linked to above for interesting details on how PSGD works and experiments using PSGD. ### `kron`: The most versatile and easy-to-use PSGD optimizer is `kron`, which uses a Kronecker-factored preconditioner. It has less hyperparameters that need tuning than adam, and can generally act as a drop-in replacement for adam. ## Installation ```bash pip install kron-torch ``` ## Basic Usage (Kron) Kron schedules the preconditioner update probability by default to start at 1.0 and anneal to 0.03 at the beginning of training, so training will be slightly slower at the start but will speed up to near adam's speed by around 3k steps. For basic usage, use `kron` optimizer like any other pytorch optimizer: ```python from kron_torch import Kron optimizer = Kron(params) optimizer.zero_grad() loss.backward() optimizer.step() ``` **Basic hyperparameters:** TLDR: Learning rate and weight decay act similarly to adam's, start with adam-like settings and go from there. There is no b2 or epsilon. These next settings control whether a dimension's preconditioner is diagonal or triangular. For example, for a layer with shape (256, 128), triagular preconditioners would be shapes (256, 256) and (128, 128), and diagonal preconditioners would be shapes (256,) and (128,). Depending on how these settings are chosen, `kron` can balance between memory/speed and effectiveness (see below). `max_size_triangular`: Anything above this value will have a diagonal preconditioner, anything below will have a triangular preconditioner. So if you have a dim with size 16,384 that you want to use a diagonal preconditioner for, set `max_size_triangular` to something like 15,000. Default is 8192. `max_skew_triangular`: Any tensor with skew above this value with make the larger dim diagonal. For example, if `max_skew_triangular` = 10, a bias layer of shape (256,) would be diagonal because 256/1 > 10, and an embedding layer with shape (50000, 768) would be (diag, tri) because 50000/768 is greater than 10. The default value is 'inf'. `min_ndim_triangular`: Any tensor with less than this number of dims will have all diagonal preconditioners. Default is 2, so single-dim tensors like bias and scale will use diagonal preconditioners. Interesting setups using these settings: - Setting `max_size_triangular` to 0 will make all layers have diagonal preconditioners, which uses very little memory and runs the fastest, but the optimizer might be less effective. - With `max_skew_triangular` set to 1, if a layer has one dim larger than the rest, it will use a diagonal preconditioner. This setup usually results in less memory usage than adam, and is more performant than having all diagonal preconditioners. `preconditioner_update_probability`: Preconditioner update probability uses a schedule by default that works well for most cases. It anneals from 1 to 0.03 at the beginning of training, so training will be slightly slower at the start but will speed up to near adam's speed by around 3k steps. PSGD generally benefits from more preconditioner updates at the start of training, but once the preconditioner is learned it's okay to do them less often. This is the default schedule in the `precond_update_prob_schedule` function at the top of kron.py: <img src="assets/default_schedule.png" alt="Default Schedule" width="800" style="max-width: 100%; height: auto;" /> See kron.py for more hyperparameter details. ## Resources PSGD papers and resources listed from Xi-Lin's repo 1) Xi-Lin Li. Preconditioned stochastic gradient descent, [arXiv:1512.04202](https://arxiv.org/abs/1512.04202), 2015. (General ideas of PSGD, preconditioner fitting losses and Kronecker product preconditioners.) 2) Xi-Lin Li. Preconditioner on matrix Lie group for SGD, [arXiv:1809.10232](https://arxiv.org/abs/1809.10232), 2018. (Focus on preconditioners with the affine Lie group.) 3) Xi-Lin Li. Black box Lie group preconditioners for SGD, [arXiv:2211.04422](https://arxiv.org/abs/2211.04422), 2022. (Mainly about the LRA preconditioner. See [these supplementary materials](https://drive.google.com/file/d/1CTNx1q67_py87jn-0OI-vSLcsM1K7VsM/view) for detailed math derivations.) 4) Xi-Lin Li. Stochastic Hessian fittings on Lie groups, [arXiv:2402.11858](https://arxiv.org/abs/2402.11858), 2024. (Some theoretical works on the efficiency of PSGD. The Hessian fitting problem is shown to be strongly convex on set ${\rm GL}(n, \mathbb{R})/R_{\rm polar}$.) 5) Omead Pooladzandi, Xi-Lin Li. Curvature-informed SGD via general purpose Lie-group preconditioners, [arXiv:2402.04553](https://arxiv.org/abs/2402.04553), 2024. (Plenty of benchmark results and analyses for PSGD vs. other optimizers.) ## License [![CC BY 4.0][cc-by-image]][cc-by] This work is licensed under a [Creative Commons Attribution 4.0 International License][cc-by]. 2024 Evan Walters, Omead Pooladzandi, Xi-Lin Li [cc-by]: http://creativecommons.org/licenses/by/4.0/ [cc-by-image]: https://licensebuttons.net/l/by/4.0/88x31.png [cc-by-shield]: https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg

{ "_id": null, "home_page": null, "name": "kron-torch", "maintainer": null, "docs_url": null, "requires_python": ">=3.9", "maintainer_email": null, "keywords": "python, machine learning, optimization, pytorch", "author": "Evan Walters, Omead Pooladzandi, Xi-Lin Li", "author_email": null, "download_url": "https://files.pythonhosted.org/packages/72/47/d1d00f5314e8c51f57d5e2ec98e15fc5976b59ddf1335619853f6329575a/kron_torch-0.2.1.tar.gz", "platform": null, "description": "# PSGD Kron\n\nFor original PSGD repo, see [psgd_torch](https://github.com/lixilinx/psgd_torch).\n\nFor JAX version, see [psgd_jax](https://github.com/evanatyourservice/psgd_jax).\n\nImplementation of [PSGD Kron optimizer](https://github.com/lixilinx/psgd_torch) in PyTorch. \nPSGD is a second-order optimizer originally created by Xi-Lin Li that uses either a hessian-based \nor whitening-based (gg^T) preconditioner and lie groups to improve training convergence, \ngeneralization, and efficiency. I highly suggest taking a look at Xi-Lin's PSGD repo's readme linked\nto above for interesting details on how PSGD works and experiments using PSGD.\n\n### `kron`:\n\nThe most versatile and easy-to-use PSGD optimizer is `kron`, which uses a Kronecker-factored \npreconditioner. It has less hyperparameters that need tuning than adam, and can generally act as a \ndrop-in replacement for adam.\n\n## Installation\n\n```bash\npip install kron-torch\n```\n\n## Basic Usage (Kron)\n\nKron schedules the preconditioner update probability by default to start at 1.0 and anneal to 0.03 \nat the beginning of training, so training will be slightly slower at the start but will speed up \nto near adam's speed by around 3k steps.\n\nFor basic usage, use `kron` optimizer like any other pytorch optimizer:\n\n```python\nfrom kron_torch import Kron\n\noptimizer = Kron(params)\n\noptimizer.zero_grad()\nloss.backward()\noptimizer.step()\n```\n\n**Basic hyperparameters:**\n\nTLDR: Learning rate and weight decay act similarly to adam's, start with adam-like settings and go \nfrom there. There is no b2 or epsilon.\n\nThese next settings control whether a dimension's preconditioner is diagonal or triangular. \nFor example, for a layer with shape (256, 128), triagular preconditioners would be shapes (256, 256)\nand (128, 128), and diagonal preconditioners would be shapes (256,) and (128,). Depending on how \nthese settings are chosen, `kron` can balance between memory/speed and effectiveness (see below).\n\n`max_size_triangular`: Anything above this value will have a diagonal preconditioner, anything \nbelow will have a triangular preconditioner. So if you have a dim with size 16,384 that you want \nto use a diagonal preconditioner for, set `max_size_triangular` to something like 15,000. Default \nis 8192.\n\n`max_skew_triangular`: Any tensor with skew above this value with make the larger dim diagonal.\nFor example, if `max_skew_triangular` = 10, a bias layer of shape (256,) would be diagonal \nbecause 256/1 > 10, and an embedding layer with shape (50000, 768) would be (diag, tri) \nbecause 50000/768 is greater than 10. The default value is 'inf'.\n\n`min_ndim_triangular`: Any tensor with less than this number of dims will have all diagonal \npreconditioners. Default is 2, so single-dim tensors like bias and scale will use diagonal\npreconditioners.\n\nInteresting setups using these settings:\n\n- Setting `max_size_triangular` to 0 will make all layers have diagonal preconditioners, which uses \nvery little memory and runs the fastest, but the optimizer might be less effective.\n\n- With `max_skew_triangular` set to 1, if a layer has one dim larger than the rest, it will use a diagonal \npreconditioner. This setup usually results in less memory usage than adam, and is more performant \nthan having all diagonal preconditioners.\n\n`preconditioner_update_probability`: Preconditioner update probability uses a schedule by default \nthat works well for most cases. It anneals from 1 to 0.03 at the beginning of training, so training \nwill be slightly slower at the start but will speed up to near adam's speed by around 3k steps. PSGD \ngenerally benefits from more preconditioner updates at the start of training, but once the preconditioner\nis learned it's okay to do them less often.\n\nThis is the default schedule in the `precond_update_prob_schedule` function at the top of kron.py:\n\n<img src=\"assets/default_schedule.png\" alt=\"Default Schedule\" width=\"800\" style=\"max-width: 100%; height: auto;\" />\n\n\nSee kron.py for more hyperparameter details.\n\n\n## Resources\n\nPSGD papers and resources listed from Xi-Lin's repo\n\n1) Xi-Lin Li. Preconditioned stochastic gradient descent, [arXiv:1512.04202](https://arxiv.org/abs/1512.04202), 2015. (General ideas of PSGD, preconditioner fitting losses and Kronecker product preconditioners.)\n2) Xi-Lin Li. Preconditioner on matrix Lie group for SGD, [arXiv:1809.10232](https://arxiv.org/abs/1809.10232), 2018. (Focus on preconditioners with the affine Lie group.)\n3) Xi-Lin Li. Black box Lie group preconditioners for SGD, [arXiv:2211.04422](https://arxiv.org/abs/2211.04422), 2022. (Mainly about the LRA preconditioner. See [these supplementary materials](https://drive.google.com/file/d/1CTNx1q67_py87jn-0OI-vSLcsM1K7VsM/view) for detailed math derivations.)\n4) Xi-Lin Li. Stochastic Hessian fittings on Lie groups, [arXiv:2402.11858](https://arxiv.org/abs/2402.11858), 2024. (Some theoretical works on the efficiency of PSGD. The Hessian fitting problem is shown to be strongly convex on set ${\\rm GL}(n, \\mathbb{R})/R_{\\rm polar}$.)\n5) Omead Pooladzandi, Xi-Lin Li. Curvature-informed SGD via general purpose Lie-group preconditioners, [arXiv:2402.04553](https://arxiv.org/abs/2402.04553), 2024. (Plenty of benchmark results and analyses for PSGD vs. other optimizers.)\n\n\n## License\n\n[![CC BY 4.0][cc-by-image]][cc-by]\n\nThis work is licensed under a [Creative Commons Attribution 4.0 International License][cc-by].\n\n2024 Evan Walters, Omead Pooladzandi, Xi-Lin Li\n\n\n[cc-by]: http://creativecommons.org/licenses/by/4.0/\n[cc-by-image]: https://licensebuttons.net/l/by/4.0/88x31.png\n[cc-by-shield]: https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg\n\n", "bugtrack_url": null, "license": null, "summary": "An implementation of PSGD Kron optimizer in PyTorch.", "version": "0.2.1", "project_urls": { "homepage": "https://github.com/evanatyourservice/kron_torch", "repository": "https://github.com/evanatyourservice/kron_torch" }, "split_keywords": [ "python", " machine learning", " optimization", " pytorch" ], "urls": [ { "comment_text": "", "digests": { "blake2b_256": "7c9698a3f4fa0d985e83165b5c0ad8c3b726ebd80f09c34a24857cbe96c30108", "md5": "982a8a834da955bc489b77700ea48e7e", "sha256": "a302cb83ad9968dacbc9ef3d3cb980444b6a9860af52df4ba1261fd1922ab0ed" }, "downloads": -1, "filename": "kron_torch-0.2.1-py3-none-any.whl", "has_sig": false, "md5_digest": "982a8a834da955bc489b77700ea48e7e", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.9", "size": 16334, "upload_time": "2024-10-07T17:13:45", "upload_time_iso_8601": "2024-10-07T17:13:45.321302Z", "url": "https://files.pythonhosted.org/packages/7c/96/98a3f4fa0d985e83165b5c0ad8c3b726ebd80f09c34a24857cbe96c30108/kron_torch-0.2.1-py3-none-any.whl", "yanked": false, "yanked_reason": null }, { "comment_text": "", "digests": { "blake2b_256": "7247d1d00f5314e8c51f57d5e2ec98e15fc5976b59ddf1335619853f6329575a", "md5": "184b631c4be2a253225c5ba2c175d233", "sha256": "ec994017a91bbc95d91dc7338640c3c4ff0dfe902082466ffd87ce36c942a146" }, "downloads": -1, "filename": "kron_torch-0.2.1.tar.gz", "has_sig": false, "md5_digest": "184b631c4be2a253225c5ba2c175d233", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.9", "size": 17411, "upload_time": "2024-10-07T17:13:51", "upload_time_iso_8601": "2024-10-07T17:13:51.536004Z", "url": "https://files.pythonhosted.org/packages/72/47/d1d00f5314e8c51f57d5e2ec98e15fc5976b59ddf1335619853f6329575a/kron_torch-0.2.1.tar.gz", "yanked": false, "yanked_reason": null } ], "upload_time": "2024-10-07 17:13:51", "github": true, "gitlab": false, "bitbucket": false, "codeberg": false, "github_user": "evanatyourservice", "github_project": "kron_torch", "travis_ci": false, "coveralls": false, "github_actions": false, "lcname": "kron-torch" }

Elapsed time: 0.35685s