# Advanced Optimizers
This repo introduces a new family of highly efficient, lightweight yet powerful optimizers, born from extensive research into recent academic literature and validated through practical training runs across diverse models.
---
### Install
`pip install adv_optm`
---
### Theory (Inspired by SMMF)
Based primarily on:
**[SMMF: Square-Matricized Momentum Factorization for Memory-Efficient Optimization](https://arxiv.org/abs/2412.08894)**
The core innovation:
- Uses fast, non-negative matrix factorization (NNMF - rank 1), but **reconstructs the full state before each update** to preserve momentum accuracy, then re-factors afterward (factor → reconstruct → update → factor cycle).
- For the *signed first moment*, we split into **sign + absolute value**:
- Sign is stored as **1-bit state** via bitwise ops (SMMF originally used 8-bit with 7 bits wasted).
- Absolute value goes through the factor/reconstruct cycle using two factored vectors + the signed state.
- Final storage: **four factored vectors + one 1-bit sign**.
- Updates behave like full-state Adam but with drastically reduced memory.
> ✅ **TL;DR**: Lightweight, strong, memory-efficient optimizer.
---
### Memory Cost
- **Adopt_Factored** for full SDXL finetune: **328 MB** (4 small vectors + 1-bit state)
- **Adopt_Factored with AdEMAMix** for full SDXL finetune: **625 MB** (6 small vectors + two 1-bit states)
> SDXL is 6.5GB model.
---
### ⏱️ Speed (my tests in SDXL - BS 4)
- **Adopt_Factored**: ~10s/it
- **Adopt_Factored with AdEMAMix**: ~12s/it
- **Adafactor**: ~8.5s/it
→ Overhead from compression/reconstruction cycles.
→ It's faster than [MLorc](https://arxiv.org/abs/2506.01897) (~12s/it), which uses RSVD compression, and should be the fastest momentum compression (AFAIK).
---
### 📈 Performance
- **Better than Adafactor, and CAME factorzation methods**
- **Comparable or identical to Adam** (see SMMF paper results)
---
### Available Optimizers (all support `Factored` toggle)
Set `Factored=False` to disable factorization and run as a full uncompressed optimizer (like vanilla Adam).
1. **Adam**
2. **Prodigy**
3. **Adopt**
---
### Bonus Features (Built-in)
- **Fused Backward Pass**
- **Stochastic Rounding (SR)**: Improves quality and convergence for **BF16 training**.
- **[AdEMAMix](https://arxiv.org/abs/2409.03137)**
→ This adds a second, slow-moving EMA, which is combined with the primary momentum to stabilize updates, especially during long runs of full finetuning.
→ A higher value of beta3 (e.g., 0.9999) gives the EMA a longer memory, making it more stable but slower to adapt. A lower value (e.g., 0.999) is often better for shorter training runs (2k-4k steps).
→ When `factored` is true, it compresses the new momentum in the same way as the first moment (1-bit state + 2 vectors). However, this introduces noticeable overhead as we are compressing/reconstructing a third state each step.
⚠️ **Note**: AdEMAMix updates are more aggressive than normal Adam/Adopt, so use a x2-x5 smaller LR than usual (or use Prodigy).
- **[`atan2` smoothing & scaling](https://github.com/lucidrains/adam-atan2-pytorch)**
→ Robust `eps` replacement (no tuning!) + built-in gradient clipping
→ *Ideal for ADOPT* (which normally needs higher `eps` and clipping), so `use_atan2` is all-in-one for it.
- **[OrthoGrad](https://github.com/LucasPrietoAl/grokking-at-the-edge-of-numerical-stability)**
→ Removes gradient component parallel to weights → prevents "naïve loss minimization" (NLM) → reduces natural overfitting
→ Perfect for fine-tuning the direction of existing features (e.g., full finetune or training a trained LoRA) without weight decay erasing prior knowledge.
⚠️ **Note**: OrthoGrad introduces **~33% time overhead**, so take this into account.
- **[Grams: Gradient Descent with Adaptive Momentum Scaling](https://github.com/Gunale0926/Grams)**
→ Eliminates the need for 1-bit momentum sign storage by using the **sign of gradients** for the first moment.
⚠️ **Not recommended for small batch sizes**: gradients are too noisy, which can destabilize momentum (tested for Prodigy and it made the optimizer slower to find the LR or converge in BS 4).
### Other Notes
- **Adopt** skips the first step (only initializes the states) and has built-in clipping (sticking to the original optimizer), but we skip both of these when you enable `use_atan2`; as the optimizer becomes scale-invariant and the values of the states won't cause any issues or instability.
- When `use_atan2` is True, `eps` will be ignored and you should also disable any gradient clipping.
---
Raw data
{
"_id": null,
"home_page": "https://github.com/Koratahiu/Advanced_Optimizers",
"name": "adv-optm",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "llm, fine-tuning, memory-efficient, low-rank, compression, pytorch, optimizer, adam",
"author": "Koratahiu",
"author_email": "hiuhonor@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/32/85/088dd00fa96c9a5debdd70f9a2311be4c806cf857ea05ffacca28d9f0188/adv_optm-0.1.2.tar.gz",
"platform": null,
"description": "# Advanced Optimizers\r\n\r\nThis repo introduces a new family of highly efficient, lightweight yet powerful optimizers, born from extensive research into recent academic literature and validated through practical training runs across diverse models.\r\n\r\n---\r\n\r\n### Install\r\n\r\n`pip install adv_optm`\r\n\r\n---\r\n\r\n### Theory (Inspired by SMMF)\r\n\r\nBased primarily on: \r\n**[SMMF: Square-Matricized Momentum Factorization for Memory-Efficient Optimization](https://arxiv.org/abs/2412.08894)**\r\n\r\nThe core innovation:\r\n- Uses fast, non-negative matrix factorization (NNMF - rank 1), but **reconstructs the full state before each update** to preserve momentum accuracy, then re-factors afterward (factor \u2192 reconstruct \u2192 update \u2192 factor cycle).\r\n- For the *signed first moment*, we split into **sign + absolute value**:\r\n - Sign is stored as **1-bit state** via bitwise ops (SMMF originally used 8-bit with 7 bits wasted).\r\n - Absolute value goes through the factor/reconstruct cycle using two factored vectors + the signed state.\r\n- Final storage: **four factored vectors + one 1-bit sign**.\r\n- Updates behave like full-state Adam but with drastically reduced memory.\r\n\r\n> \u2705 **TL;DR**: Lightweight, strong, memory-efficient optimizer.\r\n\r\n---\r\n\r\n### Memory Cost\r\n\r\n- **Adopt_Factored** for full SDXL finetune: **328 MB** (4 small vectors + 1-bit state)\r\n- **Adopt_Factored with AdEMAMix** for full SDXL finetune: **625 MB** (6 small vectors + two 1-bit states)\r\n> SDXL is 6.5GB model.\r\n\r\n---\r\n\r\n### \u23f1\ufe0f Speed (my tests in SDXL - BS 4)\r\n\r\n- **Adopt_Factored**: ~10s/it\r\n- **Adopt_Factored with AdEMAMix**: ~12s/it\r\n- **Adafactor**: ~8.5s/it \r\n\u2192 Overhead from compression/reconstruction cycles.\r\n\u2192 It's faster than [MLorc](https://arxiv.org/abs/2506.01897) (~12s/it), which uses RSVD compression, and should be the fastest momentum compression (AFAIK).\r\n\r\n---\r\n\r\n### \ud83d\udcc8 Performance\r\n\r\n- **Better than Adafactor, and CAME factorzation methods**\r\n- **Comparable or identical to Adam** (see SMMF paper results)\r\n\r\n---\r\n\r\n### Available Optimizers (all support `Factored` toggle)\r\n\r\nSet `Factored=False` to disable factorization and run as a full uncompressed optimizer (like vanilla Adam).\r\n\r\n1. **Adam**\r\n2. **Prodigy**\r\n3. **Adopt**\r\n\r\n---\r\n\r\n### Bonus Features (Built-in)\r\n\r\n- **Fused Backward Pass**\r\n\r\n- **Stochastic Rounding (SR)**: Improves quality and convergence for **BF16 training**.\r\n\r\n- **[AdEMAMix](https://arxiv.org/abs/2409.03137)** \r\n \u2192 This adds a second, slow-moving EMA, which is combined with the primary momentum to stabilize updates, especially during long runs of full finetuning.\r\n \u2192 A higher value of beta3 (e.g., 0.9999) gives the EMA a longer memory, making it more stable but slower to adapt. A lower value (e.g., 0.999) is often better for shorter training runs (2k-4k steps).\r\n \u2192 When `factored` is true, it compresses the new momentum in the same way as the first moment (1-bit state + 2 vectors). However, this introduces noticeable overhead as we are compressing/reconstructing a third state each step.\r\n\r\n \u26a0\ufe0f **Note**: AdEMAMix updates are more aggressive than normal Adam/Adopt, so use a x2-x5 smaller LR than usual (or use Prodigy).\r\n\r\n- **[`atan2` smoothing & scaling](https://github.com/lucidrains/adam-atan2-pytorch)** \r\n \u2192 Robust `eps` replacement (no tuning!) + built-in gradient clipping \r\n \u2192 *Ideal for ADOPT* (which normally needs higher `eps` and clipping), so `use_atan2` is all-in-one for it.\r\n\r\n- **[OrthoGrad](https://github.com/LucasPrietoAl/grokking-at-the-edge-of-numerical-stability)** \r\n \u2192 Removes gradient component parallel to weights \u2192 prevents \"na\u00efve loss minimization\" (NLM) \u2192 reduces natural overfitting \r\n \u2192 Perfect for fine-tuning the direction of existing features (e.g., full finetune or training a trained LoRA) without weight decay erasing prior knowledge.\r\n\r\n \u26a0\ufe0f **Note**: OrthoGrad introduces **~33% time overhead**, so take this into account.\r\n\r\n- **[Grams: Gradient Descent with Adaptive Momentum Scaling](https://github.com/Gunale0926/Grams)** \r\n \u2192 Eliminates the need for 1-bit momentum sign storage by using the **sign of gradients** for the first moment.\r\n\r\n \u26a0\ufe0f **Not recommended for small batch sizes**: gradients are too noisy, which can destabilize momentum (tested for Prodigy and it made the optimizer slower to find the LR or converge in BS 4).\r\n\r\n### Other Notes\r\n\r\n- **Adopt** skips the first step (only initializes the states) and has built-in clipping (sticking to the original optimizer), but we skip both of these when you enable `use_atan2`; as the optimizer becomes scale-invariant and the values of the states won't cause any issues or instability.\r\n\r\n- When `use_atan2` is True, `eps` will be ignored and you should also disable any gradient clipping.\r\n\r\n---\r\n",
"bugtrack_url": null,
"license": "Apache 2.0",
"summary": "A family of highly efficient, lightweight yet powerful optimizers.",
"version": "0.1.2",
"project_urls": {
"Homepage": "https://github.com/Koratahiu/Advanced_Optimizers"
},
"split_keywords": [
"llm",
" fine-tuning",
" memory-efficient",
" low-rank",
" compression",
" pytorch",
" optimizer",
" adam"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "e406764e2a6f5dc11c8147e0c1f2bb6a593271cff7175f0642164529ac166ab5",
"md5": "a81788e3872911a8beebd7b5c64ec7d5",
"sha256": "bd0a96fb1d9b777a67a0c1848044bc36075a7d63afacaf6e498ee68c61439774"
},
"downloads": -1,
"filename": "adv_optm-0.1.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "a81788e3872911a8beebd7b5c64ec7d5",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 31265,
"upload_time": "2025-09-15T03:04:43",
"upload_time_iso_8601": "2025-09-15T03:04:43.154732Z",
"url": "https://files.pythonhosted.org/packages/e4/06/764e2a6f5dc11c8147e0c1f2bb6a593271cff7175f0642164529ac166ab5/adv_optm-0.1.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "3285088dd00fa96c9a5debdd70f9a2311be4c806cf857ea05ffacca28d9f0188",
"md5": "0468ba643edfa46f0dd899d0e7e78cd0",
"sha256": "43e27dda31f251cd787eb5851775ade658a0066f2d3a5564acdd72528e77a44b"
},
"downloads": -1,
"filename": "adv_optm-0.1.2.tar.gz",
"has_sig": false,
"md5_digest": "0468ba643edfa46f0dd899d0e7e78cd0",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 22401,
"upload_time": "2025-09-15T03:04:45",
"upload_time_iso_8601": "2025-09-15T03:04:45.205345Z",
"url": "https://files.pythonhosted.org/packages/32/85/088dd00fa96c9a5debdd70f9a2311be4c806cf857ea05ffacca28d9f0188/adv_optm-0.1.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-09-15 03:04:45",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Koratahiu",
"github_project": "Advanced_Optimizers",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "adv-optm"
}