adv-optm


Nameadv-optm JSON
Version 0.1.2 PyPI version JSON
download
home_pagehttps://github.com/Koratahiu/Advanced_Optimizers
SummaryA family of highly efficient, lightweight yet powerful optimizers.
upload_time2025-09-15 03:04:45
maintainerNone
docs_urlNone
authorKoratahiu
requires_python>=3.8
licenseApache 2.0
keywords llm fine-tuning memory-efficient low-rank compression pytorch optimizer adam
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Advanced Optimizers

This repo introduces a new family of highly efficient, lightweight yet powerful optimizers, born from extensive research into recent academic literature and validated through practical training runs across diverse models.

---

### Install

`pip install adv_optm`

---

### Theory (Inspired by SMMF)

Based primarily on:  
**[SMMF: Square-Matricized Momentum Factorization for Memory-Efficient Optimization](https://arxiv.org/abs/2412.08894)**

The core innovation:
- Uses fast, non-negative matrix factorization (NNMF - rank 1), but **reconstructs the full state before each update** to preserve momentum accuracy, then re-factors afterward (factor → reconstruct → update → factor cycle).
- For the *signed first moment*, we split into **sign + absolute value**:
  - Sign is stored as **1-bit state** via bitwise ops (SMMF originally used 8-bit with 7 bits wasted).
  - Absolute value goes through the factor/reconstruct cycle using two factored vectors + the signed state.
- Final storage: **four factored vectors + one 1-bit sign**.
- Updates behave like full-state Adam but with drastically reduced memory.

> ✅ **TL;DR**: Lightweight, strong, memory-efficient optimizer.

---

### Memory Cost

- **Adopt_Factored** for full SDXL finetune: **328 MB** (4 small vectors + 1-bit state)
- **Adopt_Factored with AdEMAMix** for full SDXL finetune: **625 MB** (6 small vectors + two 1-bit states)
> SDXL is 6.5GB model.

---

### ⏱️ Speed (my tests in SDXL - BS 4)

- **Adopt_Factored**: ~10s/it
- **Adopt_Factored with AdEMAMix**: ~12s/it
- **Adafactor**: ~8.5s/it  
→ Overhead from compression/reconstruction cycles.
→ It's faster than [MLorc](https://arxiv.org/abs/2506.01897) (~12s/it), which uses RSVD compression, and should be the fastest momentum compression (AFAIK).

---

### 📈 Performance

- **Better than Adafactor, and CAME factorzation methods**
- **Comparable or identical to Adam** (see SMMF paper results)

---

### Available Optimizers (all support `Factored` toggle)

Set `Factored=False` to disable factorization and run as a full uncompressed optimizer (like vanilla Adam).

1. **Adam**
2. **Prodigy**
3. **Adopt**

---

### Bonus Features (Built-in)

- **Fused Backward Pass**

- **Stochastic Rounding (SR)**: Improves quality and convergence for **BF16 training**.

- **[AdEMAMix](https://arxiv.org/abs/2409.03137)**  
  → This adds a second, slow-moving EMA, which is combined with the primary momentum to stabilize updates, especially during long runs of full finetuning.
  → A higher value of beta3 (e.g., 0.9999) gives the EMA a longer memory, making it more stable but slower to adapt. A lower value (e.g., 0.999) is often better for shorter training runs (2k-4k steps).
  → When `factored` is true, it compresses the new momentum in the same way as the first moment (1-bit state + 2 vectors). However, this introduces noticeable overhead as we are compressing/reconstructing a third state each step.

  ⚠️ **Note**: AdEMAMix updates are more aggressive than normal Adam/Adopt, so use a x2-x5 smaller LR than usual (or use Prodigy).

- **[`atan2` smoothing & scaling](https://github.com/lucidrains/adam-atan2-pytorch)**  
  → Robust `eps` replacement (no tuning!) + built-in gradient clipping  
  → *Ideal for ADOPT* (which normally needs higher `eps` and clipping), so `use_atan2` is all-in-one for it.

- **[OrthoGrad](https://github.com/LucasPrietoAl/grokking-at-the-edge-of-numerical-stability)**  
  → Removes gradient component parallel to weights → prevents "naïve loss minimization" (NLM) → reduces natural overfitting  
  → Perfect for fine-tuning the direction of existing features (e.g., full finetune or training a trained LoRA) without weight decay erasing prior knowledge.

  ⚠️ **Note**: OrthoGrad introduces **~33% time overhead**, so take this into account.

- **[Grams: Gradient Descent with Adaptive Momentum Scaling](https://github.com/Gunale0926/Grams)**  
  → Eliminates the need for 1-bit momentum sign storage by using the **sign of gradients** for the first moment.

  ⚠️ **Not recommended for small batch sizes**: gradients are too noisy, which can destabilize momentum (tested for Prodigy and it made the optimizer slower to find the LR or converge in BS 4).

### Other Notes

- **Adopt** skips the first step (only initializes the states) and has built-in clipping (sticking to the original optimizer), but we skip both of these when you enable `use_atan2`; as the optimizer becomes scale-invariant and the values of the states won't cause any issues or instability.

- When `use_atan2` is True, `eps` will be ignored and you should also disable any gradient clipping.

---

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/Koratahiu/Advanced_Optimizers",
    "name": "adv-optm",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "llm, fine-tuning, memory-efficient, low-rank, compression, pytorch, optimizer, adam",
    "author": "Koratahiu",
    "author_email": "hiuhonor@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/32/85/088dd00fa96c9a5debdd70f9a2311be4c806cf857ea05ffacca28d9f0188/adv_optm-0.1.2.tar.gz",
    "platform": null,
    "description": "# Advanced Optimizers\r\n\r\nThis repo introduces a new family of highly efficient, lightweight yet powerful optimizers, born from extensive research into recent academic literature and validated through practical training runs across diverse models.\r\n\r\n---\r\n\r\n### Install\r\n\r\n`pip install adv_optm`\r\n\r\n---\r\n\r\n### Theory (Inspired by SMMF)\r\n\r\nBased primarily on:  \r\n**[SMMF: Square-Matricized Momentum Factorization for Memory-Efficient Optimization](https://arxiv.org/abs/2412.08894)**\r\n\r\nThe core innovation:\r\n- Uses fast, non-negative matrix factorization (NNMF - rank 1), but **reconstructs the full state before each update** to preserve momentum accuracy, then re-factors afterward (factor \u2192 reconstruct \u2192 update \u2192 factor cycle).\r\n- For the *signed first moment*, we split into **sign + absolute value**:\r\n  - Sign is stored as **1-bit state** via bitwise ops (SMMF originally used 8-bit with 7 bits wasted).\r\n  - Absolute value goes through the factor/reconstruct cycle using two factored vectors + the signed state.\r\n- Final storage: **four factored vectors + one 1-bit sign**.\r\n- Updates behave like full-state Adam but with drastically reduced memory.\r\n\r\n> \u2705 **TL;DR**: Lightweight, strong, memory-efficient optimizer.\r\n\r\n---\r\n\r\n### Memory Cost\r\n\r\n- **Adopt_Factored** for full SDXL finetune: **328 MB** (4 small vectors + 1-bit state)\r\n- **Adopt_Factored with AdEMAMix** for full SDXL finetune: **625 MB** (6 small vectors + two 1-bit states)\r\n> SDXL is 6.5GB model.\r\n\r\n---\r\n\r\n### \u23f1\ufe0f Speed (my tests in SDXL - BS 4)\r\n\r\n- **Adopt_Factored**: ~10s/it\r\n- **Adopt_Factored with AdEMAMix**: ~12s/it\r\n- **Adafactor**: ~8.5s/it  \r\n\u2192 Overhead from compression/reconstruction cycles.\r\n\u2192 It's faster than [MLorc](https://arxiv.org/abs/2506.01897) (~12s/it), which uses RSVD compression, and should be the fastest momentum compression (AFAIK).\r\n\r\n---\r\n\r\n### \ud83d\udcc8 Performance\r\n\r\n- **Better than Adafactor, and CAME factorzation methods**\r\n- **Comparable or identical to Adam** (see SMMF paper results)\r\n\r\n---\r\n\r\n### Available Optimizers (all support `Factored` toggle)\r\n\r\nSet `Factored=False` to disable factorization and run as a full uncompressed optimizer (like vanilla Adam).\r\n\r\n1. **Adam**\r\n2. **Prodigy**\r\n3. **Adopt**\r\n\r\n---\r\n\r\n### Bonus Features (Built-in)\r\n\r\n- **Fused Backward Pass**\r\n\r\n- **Stochastic Rounding (SR)**: Improves quality and convergence for **BF16 training**.\r\n\r\n- **[AdEMAMix](https://arxiv.org/abs/2409.03137)**  \r\n  \u2192 This adds a second, slow-moving EMA, which is combined with the primary momentum to stabilize updates, especially during long runs of full finetuning.\r\n  \u2192 A higher value of beta3 (e.g., 0.9999) gives the EMA a longer memory, making it more stable but slower to adapt. A lower value (e.g., 0.999) is often better for shorter training runs (2k-4k steps).\r\n  \u2192 When `factored` is true, it compresses the new momentum in the same way as the first moment (1-bit state + 2 vectors). However, this introduces noticeable overhead as we are compressing/reconstructing a third state each step.\r\n\r\n  \u26a0\ufe0f **Note**: AdEMAMix updates are more aggressive than normal Adam/Adopt, so use a x2-x5 smaller LR than usual (or use Prodigy).\r\n\r\n- **[`atan2` smoothing & scaling](https://github.com/lucidrains/adam-atan2-pytorch)**  \r\n  \u2192 Robust `eps` replacement (no tuning!) + built-in gradient clipping  \r\n  \u2192 *Ideal for ADOPT* (which normally needs higher `eps` and clipping), so `use_atan2` is all-in-one for it.\r\n\r\n- **[OrthoGrad](https://github.com/LucasPrietoAl/grokking-at-the-edge-of-numerical-stability)**  \r\n  \u2192 Removes gradient component parallel to weights \u2192 prevents \"na\u00efve loss minimization\" (NLM) \u2192 reduces natural overfitting  \r\n  \u2192 Perfect for fine-tuning the direction of existing features (e.g., full finetune or training a trained LoRA) without weight decay erasing prior knowledge.\r\n\r\n  \u26a0\ufe0f **Note**: OrthoGrad introduces **~33% time overhead**, so take this into account.\r\n\r\n- **[Grams: Gradient Descent with Adaptive Momentum Scaling](https://github.com/Gunale0926/Grams)**  \r\n  \u2192 Eliminates the need for 1-bit momentum sign storage by using the **sign of gradients** for the first moment.\r\n\r\n  \u26a0\ufe0f **Not recommended for small batch sizes**: gradients are too noisy, which can destabilize momentum (tested for Prodigy and it made the optimizer slower to find the LR or converge in BS 4).\r\n\r\n### Other Notes\r\n\r\n- **Adopt** skips the first step (only initializes the states) and has built-in clipping (sticking to the original optimizer), but we skip both of these when you enable `use_atan2`; as the optimizer becomes scale-invariant and the values of the states won't cause any issues or instability.\r\n\r\n- When `use_atan2` is True, `eps` will be ignored and you should also disable any gradient clipping.\r\n\r\n---\r\n",
    "bugtrack_url": null,
    "license": "Apache 2.0",
    "summary": "A family of highly efficient, lightweight yet powerful optimizers.",
    "version": "0.1.2",
    "project_urls": {
        "Homepage": "https://github.com/Koratahiu/Advanced_Optimizers"
    },
    "split_keywords": [
        "llm",
        " fine-tuning",
        " memory-efficient",
        " low-rank",
        " compression",
        " pytorch",
        " optimizer",
        " adam"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "e406764e2a6f5dc11c8147e0c1f2bb6a593271cff7175f0642164529ac166ab5",
                "md5": "a81788e3872911a8beebd7b5c64ec7d5",
                "sha256": "bd0a96fb1d9b777a67a0c1848044bc36075a7d63afacaf6e498ee68c61439774"
            },
            "downloads": -1,
            "filename": "adv_optm-0.1.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "a81788e3872911a8beebd7b5c64ec7d5",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 31265,
            "upload_time": "2025-09-15T03:04:43",
            "upload_time_iso_8601": "2025-09-15T03:04:43.154732Z",
            "url": "https://files.pythonhosted.org/packages/e4/06/764e2a6f5dc11c8147e0c1f2bb6a593271cff7175f0642164529ac166ab5/adv_optm-0.1.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "3285088dd00fa96c9a5debdd70f9a2311be4c806cf857ea05ffacca28d9f0188",
                "md5": "0468ba643edfa46f0dd899d0e7e78cd0",
                "sha256": "43e27dda31f251cd787eb5851775ade658a0066f2d3a5564acdd72528e77a44b"
            },
            "downloads": -1,
            "filename": "adv_optm-0.1.2.tar.gz",
            "has_sig": false,
            "md5_digest": "0468ba643edfa46f0dd899d0e7e78cd0",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 22401,
            "upload_time": "2025-09-15T03:04:45",
            "upload_time_iso_8601": "2025-09-15T03:04:45.205345Z",
            "url": "https://files.pythonhosted.org/packages/32/85/088dd00fa96c9a5debdd70f9a2311be4c806cf857ea05ffacca28d9f0188/adv_optm-0.1.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-09-15 03:04:45",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Koratahiu",
    "github_project": "Advanced_Optimizers",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "adv-optm"
}
        
Elapsed time: 1.32721s