# Advanced Optimizers (AIO)
A comprehensive, all-in-one collection of optimization algorithms for deep learning, designed for **maximum efficiency**, **minimal memory footprint**, and **superior performance** across diverse model architectures and training scenarios.
[](https://pypi.org/project/adv_optm/)
---
## π¦ Installation
```bash
pip install adv_optm
```
---
## π§ Core Innovations
This library integrates multiple state-of-the-art optimization techniques validated through extensive research and practical training, with **1-bit compression for optimizer states**:
### **Memory-Efficient Optimization (SMMF-inspired)**
- **Paper**: [SMMF: Square-Matricized Momentum Factorization](https://arxiv.org/abs/2412.08894)
- **Approach**: Uses rank-1 non-negative matrix factorization with reconstruction cycle (factor β reconstruct β update β factor)
- **Innovation**:
- First moment split into **1-bit sign + absolute value**
- Final storage: **four factored vectors + one 1-bit sign state**
- Preserves Adam-like update quality with drastically reduced memory
---
## β‘ Performance Characteristics
### Memory Efficiency (SDXL Model β 6.5GB)
| Optimizer | Memory Usage | Description |
|-----------|--------------|-------------|
| `Adopt_Factored` | 328 MB | 4 small vectors + 1-bit state |
| `Adopt_Factored + AdEMAMix` | 625 MB | 6 small vectors + two 1-bit states |
| `Simplified_AdEMAMix` | 328 MB | Same as standard factored (no extra state) |
### Speed Comparison (SDXL, Batch Size 4)
| Optimizer | Speed | Notes |
|-----------|-------|-------|
| `Adafactor` | ~8.5s/it | Baseline |
| `Adopt_Factored` | ~10s/it | +18% overhead from compression |
| `Adopt_Factored + AdEMAMix` | ~12s/it | +41% overhead (3 factored states) |
---
## π§ͺ Available Optimizers
### Standard Optimizers (All support `factored=True/False`)
| Optimizer | Description | Best For |
|-----------|-------------|----------|
| `Adam_Adv` | Advanced Adam implementation | General purpose |
| `Adopt_Adv` | Adam-variant with independent beta2 | Stable training for small batch size regimes |
| `Prodigy_Adv` | Prodigy with D-Adaptation | Adam with automatic LR tuning |
| `Simplified_AdEMAMix` | Adam variant with accumulator momentum | Small/large batch training when tuned correctly |
| `Lion_Adv` | Advanced Lion implementation | Memory-constrained environments |
| `Prodigy_Lion_Adv` | Prodigy + Lion combination | Lion with automatic LR tuning |
---
## βοΈ Feature Matrix
| Feature | Adam_Adv | Adopt_Adv | Prodigy_Adv | Simplified_AdEMAMix | Lion_Adv |
|---------|----------|-----------|-------------|---------------------|----------|
| Factored | β | β | β | β | β |
| AdEMAMix | β | β | β | β | β |
| Simplified_AdEMAMix | β | β | β | β | β |
| OrthoGrad | β | β | β | β | β |
| Grams | β | β | β | β | β |
| Cautious | β | β | β | β | β |
| atan2 | β | β | β | β | β |
| Stochastic Rounding | β | β | β | β | β |
| Fused Backward Pass | β | β | β | β | β |
| **Kourkoutas-Ξ²** | β | β | β | β | β |
---
## π οΈ Comprehensive Feature Guide
### A. Universal Safe Features
*These features work with all optimizers and are generally safe to enable.*
| Feature | Description | Recommended Usage | Performance Impact | Theoretical Basis | Compatibility |
|--------|-------------|-------------------|--------------------|-------------------|--------------|
| **Fused Back Pass** | Fuses backward pass; gradients used immediately and memory freed on-the-fly | Memory-constrained environments | Reduces peak memory | Memory optimization | All optimizers |
| **Stochastic Rounding** | Replaces nearest rounding with stochastic rounding to preserve small gradient updates in BF16 | BF16 training | Minimal overhead (<5%) | [Revisiting BFloat16 Training](https://arxiv.org/abs/2010.06192) | All optimizers |
| **OrthoGrad** | Removes gradient component parallel to weights to reduce overfitting | Full fine-tuning without weight decay | +33% time overhead (BS=4); less at larger BS | [Grokking at Edge](https://github.com/LucasPrietoAl/grokking-at-the-edge-of-numerical-stability) | All optimizers |
| **Factored** | Memory-efficient optimization via rank-1 1-bit factorization of optimizer states | Large models / memory-limited hardware | Adds compression overhead | [SMMF](https://arxiv.org/abs/2412.08894) | All optimizers |
### B. Individual Features
| Feature | Description | Recommended Usage | Performance Impact | Theoretical Basis | Compatibility |
|--------|-------------|-------------------|--------------------|-------------------|--------------|
| **Cautious** | Only applies update if gradient direction aligns with momentum direction | Accelerating convergence | No overhead | [C-Optim](https://github.com/kyleliang919/C-Optim) | Adam/Adopt/Prodigy/Lion |
| **Grams** | Update direction derived purely from current gradient | When Cautious is insufficient | No overhead | [Grams](https://github.com/Gunale0926/Grams) | Adam/Adopt/Prodigy |
| **AdEMAMix** | Dual EMA system that retains relevance of gradients over tens of thousands of steps | Long training runs, especially where model forgetting is a concern | +1 state memory | [AdEMAMix](https://arxiv.org/abs/2409.03137) | Adam/Adopt/Prodigy |
| **Simplified_AdEMAMix** | Accumulator-based momentum, single EMA variant of AdEMAMix | All scenarios when tuned correctly | No overhead | [Connections](https://arxiv.org/abs/2502.02431) | Adam/Adopt/Prodigy |
| **atan2** | Robust epsilon replacement with built-in gradient clipping | Use for stable bounded updates (or for Adopt as it needs that) | No overhead | [Adam-atan2](https://github.com/lucidrains/adam-atan2-pytorch) | Adam/Adopt/Prodigy |
| **Kourkoutas-Ξ²** | Layer-wise adaptive Ξ²β based on gradient βsunspikeβ ratio | Noisy/small/large-batch/high-LR training | No overhead | [Kourkoutas-Ξ²]() | Adam/Adopt/Prodigy/Simplified_AdEMAMix |
> **Note**: If both **Cautious** and **Grams** are enabled, **Grams takes precedence** and Cautious is disabled.
---
## π Feature Deep Dives
### AdEMAMix
- Adds a **slow-decaying second EMA** (`beta3`) that retains gradient memory over tens of thousands of steps.
- Particularly effective for **small batch sizes**, where Adamβs standard first moment is nearly useless.
- **Reference**: [AdaMeM: Memory Efficient Momentum for Adafactor](https://openreview.net/forum?id=fZqMVTz7K5)
#### Tunable Hyperparameters
| Parameter | Default | Tuning Guide |
|-----------|---------|--------------|
| `beta3` | 0.9999 | β’ Runs >120k steps: **0.9999**<br>β’ Runs β€120k steps: **0.999** |
| `alpha` | 5 | β’ Reduce to **2β3** if diverging<br>β’ Increase to strengthen long-term memory |
> β
**Pro Tip**: Set `beta1=0` in Adam/Adopt/Prodigy to skip standard EMA entirely and rely solely on AdEMAMixβs slow EMA, ideal for small-batch regimes.
---
### Simplified_AdEMAMix
- Introduced in [Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD Variants (arXiv:2502.02431)](https://arxiv.org/abs/2502.02431).
- Replaces Adamβs first moment with a **gradient accumulator**, combining the stability of long memory with responsiveness to recent gradients.
- **Key insight**: Classical momentum **does not accelerate** in noisy (small-batch) regimes; this accumulator do.
#### Tunable Hyperparameters
| Parameter | Default | Tuning Guide |
|----------|---------|--------------|
| `beta1` | 0.99 | Controls accumulator memory length:<br>β’ Small BS: **0.99β0.9999**<br>β’ Large BS: **0.9** |
| `Grad Ξ±` | 100 | Most critical parameter:<br>β’ Inversely scales with batch size<br>β’ **100β10** for small BS (β€32)<br>β’ **1β0.1** for large BS (β₯512) |
> β οΈ **Critical**: Requires **~100x smaller learning rate** than AdamW (e.g., 1e-6 vs 1e-4).
> For `Prodigy_Adv`, set `initial_d` to:
> - **LoRA**: `1e-8`
> - **Full FT**: `1e-10`
> - **Embedding**: `1e-7`
> β οΈ **Incompatible** with: **Cautious**, **Grams**, **atan2**, and standard gradient clipping.
#### Performance Validation
**Small Batch Training (SDXL, BS=2, 1.8K steps)**

- **π’ Prodigy_Adv** (beta1=0.9, d0=1e-5): Final LR = 2.9e-4
- **π΅ Prodigy_Adv + Simplified_AdEMAMix** (beta1=0.99, Ξ±=100, d0=1e-7): Final LR = 5.8e-6
**Results**:
- Faster convergence and higher final performance with Simplified_AdEMAMix
- D-Adaptation automatically compensates for aggressive updates
- Generated samples show **significantly better quality**
---
### atan2
- Replaces `eps` in Adam-family optimizers with a **scale-invariant**, bounded update rule.
- Automatically clips updates to **[-2, 2]**, preventing destabilizing jumps.
- **Highly recommended** for `Adopt_Adv`, which is prone to instability without clipping.
---
### **Kourkoutas-Ξ²**
**Kourkoutas-Ξ²** introduces a **sunspike-driven, layer-wise adaptive second-moment decay (Ξ²β)** as an optional enhancement for `Adam_Adv`, `Adopt_Adv`, `Prodigy_Adv`, and `Simplified_AdEMAMix`.
Instead of using a fixed Ξ²β (e.g., 0.999 or 0.95), it **dynamically modulates Ξ²β per layer** based on a bounded *sunspike ratio*:
- **During gradient bursts** β Ξ²β β toward `Lower Ξ²β` β faster reaction
- **During calm phases** β Ξ²β β toward `The Selected Ξ²β` β stronger smoothing
This is especially effective for **noisy training, small batch sizes, and high learning rates**, where gradient norms shift abruptly due to noise or aggressive LR schedules.
#### Pros/Cons
| **Category** | **Details** |
|--------------|-------------|
| β
**Pros** | β’ **Layer-wise adaptation** blends benefits of high Ξ²β (strong smoothing) and low Ξ²β (fast reaction).<br>β’ **Robust to sudden loss landscape shifts**, reacts quickly during gradient bursts, smooths during calm phases.<br>β’ **High tolerance to aggressive learning rates**. |
| β οΈ **Cons** | β’ **Potentially unstable at the start of training** due to unreliable early gradient norms; mitigated by using `K-Ξ² Warmup Steps`. |
> π‘ **Best Practice**: Set `K_warmup_steps` equal to your standard LR warmup steps. During warmup, the optimizer uses the static `beta2`; adaptation begins only after warmup ends.
> π **Debugging Aid**: Enable `K_Logging` to monitor (min, max, mean) of dynamic Ξ²β values across layers every *N* steps.
#### π Performance Validation
**ADAMW_ADV - full SDXL finetuning (aggressive LR: 3e-5) (BS=4, 2.5K steps)**
<img width="1460" height="382" alt="image" src="https://github.com/user-attachments/assets/007f278a-fbac-4f3d-9cc7-274c3b959cdd" />
- π£ Fixed `beta2=0.999`
- π Auto K-beta
**Observations:**
- K-beta is clearly better and more robust/stable for high LRs.
> π **Reference**:
> - Paper: [Kourkoutas-Ξ²: A Sunspike-Driven Adam Optimizer with Desert Flair](https://arxiv.org/abs/2508.12996)
> - Code: [kbeta](https://github.com/sck-at-ucy/kbeta)
---
## Recommended Preset (Tested on LoRA/FT/Embedding)
```yaml
Learning Rate: 1
optimizer: PRODIGY_Adv
settings:
- beta1: 0.99 # Controls momentum decay, ~100-step effective memory. Adjust to 0.999 (1000 steps) or 0.9999 (10000 steps) based on training length and stability needs.
- beta2: 0.999
- kourkoutas_beta: True # For Kourkoutas-Ξ²
- K-Ξ² Warmup Steps: 50 # Or 100, 200, depending on your run
- Simplified_AdEMAMix: True
- Grad Ξ±: 100
- OrthoGrad: True
- weight_decay: 0.0
- initial_d:
β’ LoRA: 1e-8
β’ Full fine-tune: 1e-10
β’ Embedding: 1e-7
- d_coef: 1
- d_limiter: True # To stablizie Prodigy with Simplified_AdEMAMix
- factored: False # Can be true or false, quality should not degrade due to Simplified_AdEMAMixβs high tolerance to 1-bit factorization.
```
> β
**Why it works**:
> - `Kourkoutas-Ξ²` handles beta2 values
> - `Simplified_AdEMAMix` ensures responsiveness in small-batch noise
> - `OrthoGrad` prevents overfitting without weight decay
---
## π References
1. [Revisiting BFloat16 Training](https://arxiv.org/abs/2010.06192)
2. [SMMF: Square-Matricized Momentum Factorization](https://arxiv.org/abs/2412.08894)
3. [The AdEMAMix Optimizer](https://arxiv.org/abs/2409.03137)
4. [Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD](https://arxiv.org/abs/2502.02431)
5. [AdaMeM: Memory Efficient Momentum for Adafactor](https://openreview.net/forum?id=fZqMVTz7K5)
6. [Kourkoutas-Ξ²: A Sunspike-Driven Adam Optimizer with Desert Flair](https://arxiv.org/abs/2508.12996)
Raw data
{
"_id": null,
"home_page": "https://github.com/Koratahiu/Advanced_Optimizers",
"name": "adv-optm",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "llm, fine-tuning, memory-efficient, low-rank, compression, pytorch, optimizer, adam",
"author": "Koratahiu",
"author_email": "hiuhonor@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/72/42/1e7b00e4ace473401f0953675a85bd38e392d1ff377e6268a6fe100bc9da/adv_optm-1.1.3.tar.gz",
"platform": null,
"description": "# Advanced Optimizers (AIO)\r\n\r\nA comprehensive, all-in-one collection of optimization algorithms for deep learning, designed for **maximum efficiency**, **minimal memory footprint**, and **superior performance** across diverse model architectures and training scenarios.\r\n\r\n[](https://pypi.org/project/adv_optm/)\r\n\r\n---\r\n\r\n## \ud83d\udce6 Installation\r\n\r\n```bash\r\npip install adv_optm\r\n```\r\n\r\n---\r\n\r\n## \ud83e\udde0 Core Innovations\r\n\r\nThis library integrates multiple state-of-the-art optimization techniques validated through extensive research and practical training, with **1-bit compression for optimizer states**:\r\n\r\n### **Memory-Efficient Optimization (SMMF-inspired)**\r\n- **Paper**: [SMMF: Square-Matricized Momentum Factorization](https://arxiv.org/abs/2412.08894)\r\n- **Approach**: Uses rank-1 non-negative matrix factorization with reconstruction cycle (factor \u2192 reconstruct \u2192 update \u2192 factor)\r\n- **Innovation**: \r\n - First moment split into **1-bit sign + absolute value**\r\n - Final storage: **four factored vectors + one 1-bit sign state**\r\n - Preserves Adam-like update quality with drastically reduced memory\r\n\r\n---\r\n\r\n## \u26a1 Performance Characteristics\r\n\r\n### Memory Efficiency (SDXL Model \u2013 6.5GB)\r\n| Optimizer | Memory Usage | Description |\r\n|-----------|--------------|-------------|\r\n| `Adopt_Factored` | 328 MB | 4 small vectors + 1-bit state |\r\n| `Adopt_Factored + AdEMAMix` | 625 MB | 6 small vectors + two 1-bit states |\r\n| `Simplified_AdEMAMix` | 328 MB | Same as standard factored (no extra state) |\r\n\r\n### Speed Comparison (SDXL, Batch Size 4)\r\n| Optimizer | Speed | Notes |\r\n|-----------|-------|-------|\r\n| `Adafactor` | ~8.5s/it | Baseline |\r\n| `Adopt_Factored` | ~10s/it | +18% overhead from compression |\r\n| `Adopt_Factored + AdEMAMix` | ~12s/it | +41% overhead (3 factored states) |\r\n\r\n---\r\n\r\n## \ud83e\uddea Available Optimizers\r\n\r\n### Standard Optimizers (All support `factored=True/False`)\r\n| Optimizer | Description | Best For |\r\n|-----------|-------------|----------|\r\n| `Adam_Adv` | Advanced Adam implementation | General purpose |\r\n| `Adopt_Adv` | Adam-variant with independent beta2 | Stable training for small batch size regimes |\r\n| `Prodigy_Adv` | Prodigy with D-Adaptation | Adam with automatic LR tuning |\r\n| `Simplified_AdEMAMix` | Adam variant with accumulator momentum | Small/large batch training when tuned correctly |\r\n| `Lion_Adv` | Advanced Lion implementation | Memory-constrained environments |\r\n| `Prodigy_Lion_Adv` | Prodigy + Lion combination | Lion with automatic LR tuning |\r\n\r\n---\r\n\r\n## \u2699\ufe0f Feature Matrix\r\n\r\n| Feature | Adam_Adv | Adopt_Adv | Prodigy_Adv | Simplified_AdEMAMix | Lion_Adv |\r\n|---------|----------|-----------|-------------|---------------------|----------|\r\n| Factored | \u2713 | \u2713 | \u2713 | \u2713 | \u2713 |\r\n| AdEMAMix | \u2713 | \u2713 | \u2713 | \u2717 | \u2717 |\r\n| Simplified_AdEMAMix | \u2717 | \u2713 | \u2713 | \u2713 | \u2717 |\r\n| OrthoGrad | \u2713 | \u2713 | \u2713 | \u2713 | \u2713 |\r\n| Grams | \u2713 | \u2713 | \u2713 | \u2717 | \u2717 |\r\n| Cautious | \u2713 | \u2713 | \u2713 | \u2717 | \u2713 |\r\n| atan2 | \u2713 | \u2713 | \u2713 | \u2717 | \u2717 |\r\n| Stochastic Rounding | \u2713 | \u2713 | \u2713 | \u2713 | \u2713 |\r\n| Fused Backward Pass | \u2713 | \u2713 | \u2713 | \u2713 | \u2713 |\r\n| **Kourkoutas-\u03b2** | \u2713 | \u2713 | \u2713 | \u2713 | \u2717 |\r\n\r\n---\r\n\r\n## \ud83d\udee0\ufe0f Comprehensive Feature Guide\r\n\r\n### A. Universal Safe Features \r\n*These features work with all optimizers and are generally safe to enable.*\r\n\r\n| Feature | Description | Recommended Usage | Performance Impact | Theoretical Basis | Compatibility |\r\n|--------|-------------|-------------------|--------------------|-------------------|--------------|\r\n| **Fused Back Pass** | Fuses backward pass; gradients used immediately and memory freed on-the-fly | Memory-constrained environments | Reduces peak memory | Memory optimization | All optimizers |\r\n| **Stochastic Rounding** | Replaces nearest rounding with stochastic rounding to preserve small gradient updates in BF16 | BF16 training | Minimal overhead (<5%) | [Revisiting BFloat16 Training](https://arxiv.org/abs/2010.06192) | All optimizers |\r\n| **OrthoGrad** | Removes gradient component parallel to weights to reduce overfitting | Full fine-tuning without weight decay | +33% time overhead (BS=4); less at larger BS | [Grokking at Edge](https://github.com/LucasPrietoAl/grokking-at-the-edge-of-numerical-stability) | All optimizers |\r\n| **Factored** | Memory-efficient optimization via rank-1 1-bit factorization of optimizer states | Large models / memory-limited hardware | Adds compression overhead | [SMMF](https://arxiv.org/abs/2412.08894) | All optimizers |\r\n\r\n### B. Individual Features\r\n\r\n| Feature | Description | Recommended Usage | Performance Impact | Theoretical Basis | Compatibility |\r\n|--------|-------------|-------------------|--------------------|-------------------|--------------|\r\n| **Cautious** | Only applies update if gradient direction aligns with momentum direction | Accelerating convergence | No overhead | [C-Optim](https://github.com/kyleliang919/C-Optim) | Adam/Adopt/Prodigy/Lion |\r\n| **Grams** | Update direction derived purely from current gradient | When Cautious is insufficient | No overhead | [Grams](https://github.com/Gunale0926/Grams) | Adam/Adopt/Prodigy |\r\n| **AdEMAMix** | Dual EMA system that retains relevance of gradients over tens of thousands of steps | Long training runs, especially where model forgetting is a concern | +1 state memory | [AdEMAMix](https://arxiv.org/abs/2409.03137) | Adam/Adopt/Prodigy |\r\n| **Simplified_AdEMAMix** | Accumulator-based momentum, single EMA variant of AdEMAMix | All scenarios when tuned correctly | No overhead | [Connections](https://arxiv.org/abs/2502.02431) | Adam/Adopt/Prodigy |\r\n| **atan2** | Robust epsilon replacement with built-in gradient clipping | Use for stable bounded updates (or for Adopt as it needs that) | No overhead | [Adam-atan2](https://github.com/lucidrains/adam-atan2-pytorch) | Adam/Adopt/Prodigy |\r\n| **Kourkoutas-\u03b2** | Layer-wise adaptive \u03b2\u2082 based on gradient \u201csunspike\u201d ratio | Noisy/small/large-batch/high-LR training | No overhead | [Kourkoutas-\u03b2]() | Adam/Adopt/Prodigy/Simplified_AdEMAMix |\r\n\r\n> **Note**: If both **Cautious** and **Grams** are enabled, **Grams takes precedence** and Cautious is disabled.\r\n\r\n---\r\n\r\n## \ud83d\udd0d Feature Deep Dives\r\n\r\n### AdEMAMix\r\n\r\n- Adds a **slow-decaying second EMA** (`beta3`) that retains gradient memory over tens of thousands of steps.\r\n- Particularly effective for **small batch sizes**, where Adam\u2019s standard first moment is nearly useless.\r\n- **Reference**: [AdaMeM: Memory Efficient Momentum for Adafactor](https://openreview.net/forum?id=fZqMVTz7K5)\r\n\r\n#### Tunable Hyperparameters\r\n| Parameter | Default | Tuning Guide |\r\n|-----------|---------|--------------|\r\n| `beta3` | 0.9999 | \u2022 Runs >120k steps: **0.9999**<br>\u2022 Runs \u2264120k steps: **0.999** |\r\n| `alpha` | 5 | \u2022 Reduce to **2\u20133** if diverging<br>\u2022 Increase to strengthen long-term memory |\r\n\r\n> \u2705 **Pro Tip**: Set `beta1=0` in Adam/Adopt/Prodigy to skip standard EMA entirely and rely solely on AdEMAMix\u2019s slow EMA, ideal for small-batch regimes.\r\n\r\n---\r\n\r\n### Simplified_AdEMAMix\r\n\r\n- Introduced in [Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD Variants (arXiv:2502.02431)](https://arxiv.org/abs/2502.02431).\r\n- Replaces Adam\u2019s first moment with a **gradient accumulator**, combining the stability of long memory with responsiveness to recent gradients.\r\n- **Key insight**: Classical momentum **does not accelerate** in noisy (small-batch) regimes; this accumulator do.\r\n\r\n#### Tunable Hyperparameters\r\n| Parameter | Default | Tuning Guide |\r\n|----------|---------|--------------|\r\n| `beta1` | 0.99 | Controls accumulator memory length:<br>\u2022 Small BS: **0.99\u20130.9999**<br>\u2022 Large BS: **0.9** |\r\n| `Grad \u03b1` | 100 | Most critical parameter:<br>\u2022 Inversely scales with batch size<br>\u2022 **100\u201310** for small BS (\u226432)<br>\u2022 **1\u20130.1** for large BS (\u2265512) |\r\n\r\n> \u26a0\ufe0f **Critical**: Requires **~100x smaller learning rate** than AdamW (e.g., 1e-6 vs 1e-4). \r\n> For `Prodigy_Adv`, set `initial_d` to:\r\n> - **LoRA**: `1e-8`\r\n> - **Full FT**: `1e-10`\r\n> - **Embedding**: `1e-7`\r\n\r\n> \u26a0\ufe0f **Incompatible** with: **Cautious**, **Grams**, **atan2**, and standard gradient clipping.\r\n\r\n#### Performance Validation\r\n\r\n**Small Batch Training (SDXL, BS=2, 1.8K steps)** \r\n\r\n\r\n- **\ud83d\udfe2 Prodigy_Adv** (beta1=0.9, d0=1e-5): Final LR = 2.9e-4 \r\n- **\ud83d\udd35 Prodigy_Adv + Simplified_AdEMAMix** (beta1=0.99, \u03b1=100, d0=1e-7): Final LR = 5.8e-6\r\n\r\n**Results**:\r\n- Faster convergence and higher final performance with Simplified_AdEMAMix\r\n- D-Adaptation automatically compensates for aggressive updates\r\n- Generated samples show **significantly better quality**\r\n\r\n---\r\n\r\n### atan2\r\n\r\n- Replaces `eps` in Adam-family optimizers with a **scale-invariant**, bounded update rule.\r\n- Automatically clips updates to **[-2, 2]**, preventing destabilizing jumps.\r\n- **Highly recommended** for `Adopt_Adv`, which is prone to instability without clipping.\r\n\r\n---\r\n\r\n### **Kourkoutas-\u03b2**\r\n\r\n**Kourkoutas-\u03b2** introduces a **sunspike-driven, layer-wise adaptive second-moment decay (\u03b2\u2082)** as an optional enhancement for `Adam_Adv`, `Adopt_Adv`, `Prodigy_Adv`, and `Simplified_AdEMAMix`.\r\n\r\nInstead of using a fixed \u03b2\u2082 (e.g., 0.999 or 0.95), it **dynamically modulates \u03b2\u2082 per layer** based on a bounded *sunspike ratio*:\r\n\r\n- **During gradient bursts** \u2192 \u03b2\u2082 \u2193 toward `Lower \u03b2\u2082` \u2192 faster reaction \r\n- **During calm phases** \u2192 \u03b2\u2082 \u2191 toward `The Selected \u03b2\u2082` \u2192 stronger smoothing \r\n\r\nThis is especially effective for **noisy training, small batch sizes, and high learning rates**, where gradient norms shift abruptly due to noise or aggressive LR schedules.\r\n\r\n#### Pros/Cons\r\n\r\n| **Category** | **Details** |\r\n|--------------|-------------|\r\n| \u2705 **Pros** | \u2022 **Layer-wise adaptation** blends benefits of high \u03b2\u2082 (strong smoothing) and low \u03b2\u2082 (fast reaction).<br>\u2022 **Robust to sudden loss landscape shifts**, reacts quickly during gradient bursts, smooths during calm phases.<br>\u2022 **High tolerance to aggressive learning rates**. |\r\n| \u26a0\ufe0f **Cons** | \u2022 **Potentially unstable at the start of training** due to unreliable early gradient norms; mitigated by using `K-\u03b2 Warmup Steps`. |\r\n\r\n> \ud83d\udca1 **Best Practice**: Set `K_warmup_steps` equal to your standard LR warmup steps. During warmup, the optimizer uses the static `beta2`; adaptation begins only after warmup ends.\r\n\r\n> \ud83d\udd0d **Debugging Aid**: Enable `K_Logging` to monitor (min, max, mean) of dynamic \u03b2\u2082 values across layers every *N* steps.\r\n\r\n#### \ud83d\udcca Performance Validation\r\n\r\n**ADAMW_ADV - full SDXL finetuning (aggressive LR: 3e-5) (BS=4, 2.5K steps)** \r\n<img width=\"1460\" height=\"382\" alt=\"image\" src=\"https://github.com/user-attachments/assets/007f278a-fbac-4f3d-9cc7-274c3b959cdd\" />\r\n\r\n- \ud83d\udfe3 Fixed `beta2=0.999` \r\n- \ud83d\udfe0 Auto K-beta \r\n\r\n**Observations:** \r\n- K-beta is clearly better and more robust/stable for high LRs.\r\n\r\n> \ud83d\udcda **Reference**: \r\n> - Paper: [Kourkoutas-\u03b2: A Sunspike-Driven Adam Optimizer with Desert Flair](https://arxiv.org/abs/2508.12996) \r\n> - Code: [kbeta](https://github.com/sck-at-ucy/kbeta)\r\n\r\n---\r\n\r\n## Recommended Preset (Tested on LoRA/FT/Embedding)\r\n\r\n```yaml\r\nLearning Rate: 1\r\noptimizer: PRODIGY_Adv\r\nsettings:\r\n - beta1: 0.99 # Controls momentum decay, ~100-step effective memory. Adjust to 0.999 (1000 steps) or 0.9999 (10000 steps) based on training length and stability needs.\r\n - beta2: 0.999\r\n - kourkoutas_beta: True # For Kourkoutas-\u03b2\r\n - K-\u03b2 Warmup Steps: 50 # Or 100, 200, depending on your run\r\n - Simplified_AdEMAMix: True\r\n - Grad \u03b1: 100\r\n - OrthoGrad: True\r\n - weight_decay: 0.0\r\n - initial_d:\r\n \u2022 LoRA: 1e-8\r\n \u2022 Full fine-tune: 1e-10\r\n \u2022 Embedding: 1e-7\r\n - d_coef: 1\r\n - d_limiter: True # To stablizie Prodigy with Simplified_AdEMAMix\r\n - factored: False # Can be true or false, quality should not degrade due to Simplified_AdEMAMix\u2019s high tolerance to 1-bit factorization.\r\n```\r\n\r\n> \u2705 **Why it works**: \r\n> - `Kourkoutas-\u03b2` handles beta2 values\r\n> - `Simplified_AdEMAMix` ensures responsiveness in small-batch noise\r\n> - `OrthoGrad` prevents overfitting without weight decay\r\n\r\n---\r\n\r\n## \ud83d\udcda References\r\n\r\n1. [Revisiting BFloat16 Training](https://arxiv.org/abs/2010.06192) \r\n2. [SMMF: Square-Matricized Momentum Factorization](https://arxiv.org/abs/2412.08894) \r\n3. [The AdEMAMix Optimizer](https://arxiv.org/abs/2409.03137) \r\n4. [Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD](https://arxiv.org/abs/2502.02431) \r\n5. [AdaMeM: Memory Efficient Momentum for Adafactor](https://openreview.net/forum?id=fZqMVTz7K5) \r\n6. [Kourkoutas-\u03b2: A Sunspike-Driven Adam Optimizer with Desert Flair](https://arxiv.org/abs/2508.12996)\r\n",
"bugtrack_url": null,
"license": "Apache 2.0",
"summary": "A family of highly efficient, lightweight yet powerful optimizers.",
"version": "1.1.3",
"project_urls": {
"Homepage": "https://github.com/Koratahiu/Advanced_Optimizers"
},
"split_keywords": [
"llm",
" fine-tuning",
" memory-efficient",
" low-rank",
" compression",
" pytorch",
" optimizer",
" adam"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "f13e84e4a726812f3c0d410e71c8b753d48c12f25a24dd39feb0530d81bb765c",
"md5": "56ba70031712380854957a3512c1cacb",
"sha256": "98bc438dfc0bd181a9bf3a93099427eb84d70b71379459cd73832b36b325cee1"
},
"downloads": -1,
"filename": "adv_optm-1.1.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "56ba70031712380854957a3512c1cacb",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 45361,
"upload_time": "2025-10-16T08:49:28",
"upload_time_iso_8601": "2025-10-16T08:49:28.866524Z",
"url": "https://files.pythonhosted.org/packages/f1/3e/84e4a726812f3c0d410e71c8b753d48c12f25a24dd39feb0530d81bb765c/adv_optm-1.1.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "72421e7b00e4ace473401f0953675a85bd38e392d1ff377e6268a6fe100bc9da",
"md5": "edc3f7484c60652c3dfce32f69c56ae8",
"sha256": "4783f18d5b6fffe36c6294fb93c85628b0941fe575ab18645f8aee6e8aeca8f8"
},
"downloads": -1,
"filename": "adv_optm-1.1.3.tar.gz",
"has_sig": false,
"md5_digest": "edc3f7484c60652c3dfce32f69c56ae8",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 35449,
"upload_time": "2025-10-16T08:49:31",
"upload_time_iso_8601": "2025-10-16T08:49:31.630535Z",
"url": "https://files.pythonhosted.org/packages/72/42/1e7b00e4ace473401f0953675a85bd38e392d1ff377e6268a6fe100bc9da/adv_optm-1.1.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-16 08:49:31",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Koratahiu",
"github_project": "Advanced_Optimizers",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "adv-optm"
}