adv-optm


Nameadv-optm JSON
Version 1.1.3 PyPI version JSON
download
home_pagehttps://github.com/Koratahiu/Advanced_Optimizers
SummaryA family of highly efficient, lightweight yet powerful optimizers.
upload_time2025-10-16 08:49:31
maintainerNone
docs_urlNone
authorKoratahiu
requires_python>=3.8
licenseApache 2.0
keywords llm fine-tuning memory-efficient low-rank compression pytorch optimizer adam
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Advanced Optimizers (AIO)

A comprehensive, all-in-one collection of optimization algorithms for deep learning, designed for **maximum efficiency**, **minimal memory footprint**, and **superior performance** across diverse model architectures and training scenarios.

[![PyPI](https://img.shields.io/pypi/v/adv_optm)](https://pypi.org/project/adv_optm/)

---

## πŸ“¦ Installation

```bash
pip install adv_optm
```

---

## 🧠 Core Innovations

This library integrates multiple state-of-the-art optimization techniques validated through extensive research and practical training, with **1-bit compression for optimizer states**:

### **Memory-Efficient Optimization (SMMF-inspired)**
- **Paper**: [SMMF: Square-Matricized Momentum Factorization](https://arxiv.org/abs/2412.08894)
- **Approach**: Uses rank-1 non-negative matrix factorization with reconstruction cycle (factor β†’ reconstruct β†’ update β†’ factor)
- **Innovation**: 
  - First moment split into **1-bit sign + absolute value**
  - Final storage: **four factored vectors + one 1-bit sign state**
  - Preserves Adam-like update quality with drastically reduced memory

---

## ⚑ Performance Characteristics

### Memory Efficiency (SDXL Model – 6.5GB)
| Optimizer | Memory Usage | Description |
|-----------|--------------|-------------|
| `Adopt_Factored` | 328 MB | 4 small vectors + 1-bit state |
| `Adopt_Factored + AdEMAMix` | 625 MB | 6 small vectors + two 1-bit states |
| `Simplified_AdEMAMix` | 328 MB | Same as standard factored (no extra state) |

### Speed Comparison (SDXL, Batch Size 4)
| Optimizer | Speed | Notes |
|-----------|-------|-------|
| `Adafactor` | ~8.5s/it | Baseline |
| `Adopt_Factored` | ~10s/it | +18% overhead from compression |
| `Adopt_Factored + AdEMAMix` | ~12s/it | +41% overhead (3 factored states) |

---

## πŸ§ͺ Available Optimizers

### Standard Optimizers (All support `factored=True/False`)
| Optimizer | Description | Best For |
|-----------|-------------|----------|
| `Adam_Adv` | Advanced Adam implementation | General purpose |
| `Adopt_Adv` | Adam-variant with independent beta2 | Stable training for small batch size regimes |
| `Prodigy_Adv` | Prodigy with D-Adaptation | Adam with automatic LR tuning |
| `Simplified_AdEMAMix` | Adam variant with accumulator momentum | Small/large batch training when tuned correctly |
| `Lion_Adv` | Advanced Lion implementation | Memory-constrained environments |
| `Prodigy_Lion_Adv` | Prodigy + Lion combination | Lion with automatic LR tuning |

---

## βš™οΈ Feature Matrix

| Feature | Adam_Adv | Adopt_Adv | Prodigy_Adv | Simplified_AdEMAMix | Lion_Adv |
|---------|----------|-----------|-------------|---------------------|----------|
| Factored | βœ“ | βœ“ | βœ“ | βœ“ | βœ“ |
| AdEMAMix | βœ“ | βœ“ | βœ“ | βœ— | βœ— |
| Simplified_AdEMAMix | βœ— | βœ“ | βœ“ | βœ“ | βœ— |
| OrthoGrad | βœ“ | βœ“ | βœ“ | βœ“ | βœ“ |
| Grams | βœ“ | βœ“ | βœ“ | βœ— | βœ— |
| Cautious | βœ“ | βœ“ | βœ“ | βœ— | βœ“ |
| atan2 | βœ“ | βœ“ | βœ“ | βœ— | βœ— |
| Stochastic Rounding | βœ“ | βœ“ | βœ“ | βœ“ | βœ“ |
| Fused Backward Pass | βœ“ | βœ“ | βœ“ | βœ“ | βœ“ |
| **Kourkoutas-Ξ²** | βœ“ | βœ“ | βœ“ | βœ“ | βœ— |

---

## πŸ› οΈ Comprehensive Feature Guide

### A. Universal Safe Features  
*These features work with all optimizers and are generally safe to enable.*

| Feature | Description | Recommended Usage | Performance Impact | Theoretical Basis | Compatibility |
|--------|-------------|-------------------|--------------------|-------------------|--------------|
| **Fused Back Pass** | Fuses backward pass; gradients used immediately and memory freed on-the-fly | Memory-constrained environments | Reduces peak memory | Memory optimization | All optimizers |
| **Stochastic Rounding** | Replaces nearest rounding with stochastic rounding to preserve small gradient updates in BF16 | BF16 training | Minimal overhead (<5%) | [Revisiting BFloat16 Training](https://arxiv.org/abs/2010.06192) | All optimizers |
| **OrthoGrad** | Removes gradient component parallel to weights to reduce overfitting | Full fine-tuning without weight decay | +33% time overhead (BS=4); less at larger BS | [Grokking at Edge](https://github.com/LucasPrietoAl/grokking-at-the-edge-of-numerical-stability) | All optimizers |
| **Factored** | Memory-efficient optimization via rank-1 1-bit factorization of optimizer states | Large models / memory-limited hardware | Adds compression overhead | [SMMF](https://arxiv.org/abs/2412.08894) | All optimizers |

### B. Individual Features

| Feature | Description | Recommended Usage | Performance Impact | Theoretical Basis | Compatibility |
|--------|-------------|-------------------|--------------------|-------------------|--------------|
| **Cautious** | Only applies update if gradient direction aligns with momentum direction | Accelerating convergence | No overhead | [C-Optim](https://github.com/kyleliang919/C-Optim) | Adam/Adopt/Prodigy/Lion |
| **Grams** | Update direction derived purely from current gradient | When Cautious is insufficient | No overhead | [Grams](https://github.com/Gunale0926/Grams) | Adam/Adopt/Prodigy |
| **AdEMAMix** | Dual EMA system that retains relevance of gradients over tens of thousands of steps | Long training runs, especially where model forgetting is a concern | +1 state memory | [AdEMAMix](https://arxiv.org/abs/2409.03137) | Adam/Adopt/Prodigy |
| **Simplified_AdEMAMix** | Accumulator-based momentum, single EMA variant of AdEMAMix | All scenarios when tuned correctly | No overhead | [Connections](https://arxiv.org/abs/2502.02431) | Adam/Adopt/Prodigy |
| **atan2** | Robust epsilon replacement with built-in gradient clipping | Use for stable bounded updates (or for Adopt as it needs that) | No overhead | [Adam-atan2](https://github.com/lucidrains/adam-atan2-pytorch) | Adam/Adopt/Prodigy |
| **Kourkoutas-Ξ²** | Layer-wise adaptive Ξ²β‚‚ based on gradient β€œsunspike” ratio | Noisy/small/large-batch/high-LR training | No overhead | [Kourkoutas-Ξ²]() | Adam/Adopt/Prodigy/Simplified_AdEMAMix |

> **Note**: If both **Cautious** and **Grams** are enabled, **Grams takes precedence** and Cautious is disabled.

---

## πŸ” Feature Deep Dives

### AdEMAMix

- Adds a **slow-decaying second EMA** (`beta3`) that retains gradient memory over tens of thousands of steps.
- Particularly effective for **small batch sizes**, where Adam’s standard first moment is nearly useless.
- **Reference**: [AdaMeM: Memory Efficient Momentum for Adafactor](https://openreview.net/forum?id=fZqMVTz7K5)

#### Tunable Hyperparameters
| Parameter | Default | Tuning Guide |
|-----------|---------|--------------|
| `beta3` | 0.9999 | β€’ Runs >120k steps: **0.9999**<br>β€’ Runs ≀120k steps: **0.999** |
| `alpha` | 5 | β€’ Reduce to **2–3** if diverging<br>β€’ Increase to strengthen long-term memory |

> βœ… **Pro Tip**: Set `beta1=0` in Adam/Adopt/Prodigy to skip standard EMA entirely and rely solely on AdEMAMix’s slow EMA, ideal for small-batch regimes.

---

### Simplified_AdEMAMix

- Introduced in [Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD Variants (arXiv:2502.02431)](https://arxiv.org/abs/2502.02431).
- Replaces Adam’s first moment with a **gradient accumulator**, combining the stability of long memory with responsiveness to recent gradients.
- **Key insight**: Classical momentum **does not accelerate** in noisy (small-batch) regimes; this accumulator do.

#### Tunable Hyperparameters
| Parameter | Default | Tuning Guide |
|----------|---------|--------------|
| `beta1` | 0.99 | Controls accumulator memory length:<br>β€’ Small BS: **0.99–0.9999**<br>β€’ Large BS: **0.9** |
| `Grad Ξ±` | 100 | Most critical parameter:<br>β€’ Inversely scales with batch size<br>β€’ **100–10** for small BS (≀32)<br>β€’ **1–0.1** for large BS (β‰₯512) |

> ⚠️ **Critical**: Requires **~100x smaller learning rate** than AdamW (e.g., 1e-6 vs 1e-4).  
> For `Prodigy_Adv`, set `initial_d` to:
> - **LoRA**: `1e-8`
> - **Full FT**: `1e-10`
> - **Embedding**: `1e-7`

> ⚠️ **Incompatible** with: **Cautious**, **Grams**, **atan2**, and standard gradient clipping.

#### Performance Validation

**Small Batch Training (SDXL, BS=2, 1.8K steps)**  
![Training Comparison](https://github.com/user-attachments/assets/7eff0671-cc59-47fc-8b63-d5205456d649)

- **🟒 Prodigy_Adv** (beta1=0.9, d0=1e-5): Final LR = 2.9e-4  
- **πŸ”΅ Prodigy_Adv + Simplified_AdEMAMix** (beta1=0.99, Ξ±=100, d0=1e-7): Final LR = 5.8e-6

**Results**:
- Faster convergence and higher final performance with Simplified_AdEMAMix
- D-Adaptation automatically compensates for aggressive updates
- Generated samples show **significantly better quality**

---

### atan2

- Replaces `eps` in Adam-family optimizers with a **scale-invariant**, bounded update rule.
- Automatically clips updates to **[-2, 2]**, preventing destabilizing jumps.
- **Highly recommended** for `Adopt_Adv`, which is prone to instability without clipping.

---

### **Kourkoutas-Ξ²**

**Kourkoutas-Ξ²** introduces a **sunspike-driven, layer-wise adaptive second-moment decay (Ξ²β‚‚)** as an optional enhancement for `Adam_Adv`, `Adopt_Adv`, `Prodigy_Adv`, and `Simplified_AdEMAMix`.

Instead of using a fixed Ξ²β‚‚ (e.g., 0.999 or 0.95), it **dynamically modulates Ξ²β‚‚ per layer** based on a bounded *sunspike ratio*:

- **During gradient bursts** β†’ Ξ²β‚‚ ↓ toward `Lower Ξ²β‚‚` β†’ faster reaction  
- **During calm phases** β†’ Ξ²β‚‚ ↑ toward `The Selected Ξ²β‚‚` β†’ stronger smoothing  

This is especially effective for **noisy training, small batch sizes, and high learning rates**, where gradient norms shift abruptly due to noise or aggressive LR schedules.

#### Pros/Cons

| **Category** | **Details** |
|--------------|-------------|
| βœ… **Pros** | β€’ **Layer-wise adaptation** blends benefits of high Ξ²β‚‚ (strong smoothing) and low Ξ²β‚‚ (fast reaction).<br>β€’ **Robust to sudden loss landscape shifts**, reacts quickly during gradient bursts, smooths during calm phases.<br>β€’ **High tolerance to aggressive learning rates**. |
| ⚠️ **Cons** | β€’ **Potentially unstable at the start of training** due to unreliable early gradient norms; mitigated by using `K-Ξ² Warmup Steps`. |

> πŸ’‘ **Best Practice**: Set `K_warmup_steps` equal to your standard LR warmup steps. During warmup, the optimizer uses the static `beta2`; adaptation begins only after warmup ends.

> πŸ” **Debugging Aid**: Enable `K_Logging` to monitor (min, max, mean) of dynamic Ξ²β‚‚ values across layers every *N* steps.

#### πŸ“Š Performance Validation

**ADAMW_ADV - full SDXL finetuning (aggressive LR: 3e-5) (BS=4, 2.5K steps)**  
<img width="1460" height="382" alt="image" src="https://github.com/user-attachments/assets/007f278a-fbac-4f3d-9cc7-274c3b959cdd" />

- 🟣 Fixed `beta2=0.999`  
- 🟠 Auto K-beta  

**Observations:**  
- K-beta is clearly better and more robust/stable for high LRs.

> πŸ“š **Reference**:  
> - Paper: [Kourkoutas-Ξ²: A Sunspike-Driven Adam Optimizer with Desert Flair](https://arxiv.org/abs/2508.12996)  
> - Code: [kbeta](https://github.com/sck-at-ucy/kbeta)

---

## Recommended Preset (Tested on LoRA/FT/Embedding)

```yaml
Learning Rate: 1
optimizer: PRODIGY_Adv
settings:
  - beta1: 0.99 # Controls momentum decay, ~100-step effective memory. Adjust to 0.999 (1000 steps) or 0.9999 (10000 steps) based on training length and stability needs.
  - beta2: 0.999
  - kourkoutas_beta: True   # For Kourkoutas-Ξ²
  - K-Ξ² Warmup Steps: 50    # Or 100, 200, depending on your run
  - Simplified_AdEMAMix: True
  - Grad Ξ±: 100
  - OrthoGrad: True
  - weight_decay: 0.0
  - initial_d:
      β€’ LoRA: 1e-8
      β€’ Full fine-tune: 1e-10
      β€’ Embedding: 1e-7
  - d_coef: 1
  - d_limiter: True # To stablizie Prodigy with Simplified_AdEMAMix
  - factored: False  # Can be true or false, quality should not degrade due to Simplified_AdEMAMix’s high tolerance to 1-bit factorization.
```

> βœ… **Why it works**:  
> - `Kourkoutas-Ξ²` handles beta2 values
> - `Simplified_AdEMAMix` ensures responsiveness in small-batch noise
> - `OrthoGrad` prevents overfitting without weight decay

---

## πŸ“š References

1. [Revisiting BFloat16 Training](https://arxiv.org/abs/2010.06192)  
2. [SMMF: Square-Matricized Momentum Factorization](https://arxiv.org/abs/2412.08894)  
3. [The AdEMAMix Optimizer](https://arxiv.org/abs/2409.03137)  
4. [Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD](https://arxiv.org/abs/2502.02431)  
5. [AdaMeM: Memory Efficient Momentum for Adafactor](https://openreview.net/forum?id=fZqMVTz7K5)  
6. [Kourkoutas-Ξ²: A Sunspike-Driven Adam Optimizer with Desert Flair](https://arxiv.org/abs/2508.12996)

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/Koratahiu/Advanced_Optimizers",
    "name": "adv-optm",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "llm, fine-tuning, memory-efficient, low-rank, compression, pytorch, optimizer, adam",
    "author": "Koratahiu",
    "author_email": "hiuhonor@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/72/42/1e7b00e4ace473401f0953675a85bd38e392d1ff377e6268a6fe100bc9da/adv_optm-1.1.3.tar.gz",
    "platform": null,
    "description": "# Advanced Optimizers (AIO)\r\n\r\nA comprehensive, all-in-one collection of optimization algorithms for deep learning, designed for **maximum efficiency**, **minimal memory footprint**, and **superior performance** across diverse model architectures and training scenarios.\r\n\r\n[![PyPI](https://img.shields.io/pypi/v/adv_optm)](https://pypi.org/project/adv_optm/)\r\n\r\n---\r\n\r\n## \ud83d\udce6 Installation\r\n\r\n```bash\r\npip install adv_optm\r\n```\r\n\r\n---\r\n\r\n## \ud83e\udde0 Core Innovations\r\n\r\nThis library integrates multiple state-of-the-art optimization techniques validated through extensive research and practical training, with **1-bit compression for optimizer states**:\r\n\r\n### **Memory-Efficient Optimization (SMMF-inspired)**\r\n- **Paper**: [SMMF: Square-Matricized Momentum Factorization](https://arxiv.org/abs/2412.08894)\r\n- **Approach**: Uses rank-1 non-negative matrix factorization with reconstruction cycle (factor \u2192 reconstruct \u2192 update \u2192 factor)\r\n- **Innovation**: \r\n  - First moment split into **1-bit sign + absolute value**\r\n  - Final storage: **four factored vectors + one 1-bit sign state**\r\n  - Preserves Adam-like update quality with drastically reduced memory\r\n\r\n---\r\n\r\n## \u26a1 Performance Characteristics\r\n\r\n### Memory Efficiency (SDXL Model \u2013 6.5GB)\r\n| Optimizer | Memory Usage | Description |\r\n|-----------|--------------|-------------|\r\n| `Adopt_Factored` | 328 MB | 4 small vectors + 1-bit state |\r\n| `Adopt_Factored + AdEMAMix` | 625 MB | 6 small vectors + two 1-bit states |\r\n| `Simplified_AdEMAMix` | 328 MB | Same as standard factored (no extra state) |\r\n\r\n### Speed Comparison (SDXL, Batch Size 4)\r\n| Optimizer | Speed | Notes |\r\n|-----------|-------|-------|\r\n| `Adafactor` | ~8.5s/it | Baseline |\r\n| `Adopt_Factored` | ~10s/it | +18% overhead from compression |\r\n| `Adopt_Factored + AdEMAMix` | ~12s/it | +41% overhead (3 factored states) |\r\n\r\n---\r\n\r\n## \ud83e\uddea Available Optimizers\r\n\r\n### Standard Optimizers (All support `factored=True/False`)\r\n| Optimizer | Description | Best For |\r\n|-----------|-------------|----------|\r\n| `Adam_Adv` | Advanced Adam implementation | General purpose |\r\n| `Adopt_Adv` | Adam-variant with independent beta2 | Stable training for small batch size regimes |\r\n| `Prodigy_Adv` | Prodigy with D-Adaptation | Adam with automatic LR tuning |\r\n| `Simplified_AdEMAMix` | Adam variant with accumulator momentum | Small/large batch training when tuned correctly |\r\n| `Lion_Adv` | Advanced Lion implementation | Memory-constrained environments |\r\n| `Prodigy_Lion_Adv` | Prodigy + Lion combination | Lion with automatic LR tuning |\r\n\r\n---\r\n\r\n## \u2699\ufe0f Feature Matrix\r\n\r\n| Feature | Adam_Adv | Adopt_Adv | Prodigy_Adv | Simplified_AdEMAMix | Lion_Adv |\r\n|---------|----------|-----------|-------------|---------------------|----------|\r\n| Factored | \u2713 | \u2713 | \u2713 | \u2713 | \u2713 |\r\n| AdEMAMix | \u2713 | \u2713 | \u2713 | \u2717 | \u2717 |\r\n| Simplified_AdEMAMix | \u2717 | \u2713 | \u2713 | \u2713 | \u2717 |\r\n| OrthoGrad | \u2713 | \u2713 | \u2713 | \u2713 | \u2713 |\r\n| Grams | \u2713 | \u2713 | \u2713 | \u2717 | \u2717 |\r\n| Cautious | \u2713 | \u2713 | \u2713 | \u2717 | \u2713 |\r\n| atan2 | \u2713 | \u2713 | \u2713 | \u2717 | \u2717 |\r\n| Stochastic Rounding | \u2713 | \u2713 | \u2713 | \u2713 | \u2713 |\r\n| Fused Backward Pass | \u2713 | \u2713 | \u2713 | \u2713 | \u2713 |\r\n| **Kourkoutas-\u03b2** | \u2713 | \u2713 | \u2713 | \u2713 | \u2717 |\r\n\r\n---\r\n\r\n## \ud83d\udee0\ufe0f Comprehensive Feature Guide\r\n\r\n### A. Universal Safe Features  \r\n*These features work with all optimizers and are generally safe to enable.*\r\n\r\n| Feature | Description | Recommended Usage | Performance Impact | Theoretical Basis | Compatibility |\r\n|--------|-------------|-------------------|--------------------|-------------------|--------------|\r\n| **Fused Back Pass** | Fuses backward pass; gradients used immediately and memory freed on-the-fly | Memory-constrained environments | Reduces peak memory | Memory optimization | All optimizers |\r\n| **Stochastic Rounding** | Replaces nearest rounding with stochastic rounding to preserve small gradient updates in BF16 | BF16 training | Minimal overhead (<5%) | [Revisiting BFloat16 Training](https://arxiv.org/abs/2010.06192) | All optimizers |\r\n| **OrthoGrad** | Removes gradient component parallel to weights to reduce overfitting | Full fine-tuning without weight decay | +33% time overhead (BS=4); less at larger BS | [Grokking at Edge](https://github.com/LucasPrietoAl/grokking-at-the-edge-of-numerical-stability) | All optimizers |\r\n| **Factored** | Memory-efficient optimization via rank-1 1-bit factorization of optimizer states | Large models / memory-limited hardware | Adds compression overhead | [SMMF](https://arxiv.org/abs/2412.08894) | All optimizers |\r\n\r\n### B. Individual Features\r\n\r\n| Feature | Description | Recommended Usage | Performance Impact | Theoretical Basis | Compatibility |\r\n|--------|-------------|-------------------|--------------------|-------------------|--------------|\r\n| **Cautious** | Only applies update if gradient direction aligns with momentum direction | Accelerating convergence | No overhead | [C-Optim](https://github.com/kyleliang919/C-Optim) | Adam/Adopt/Prodigy/Lion |\r\n| **Grams** | Update direction derived purely from current gradient | When Cautious is insufficient | No overhead | [Grams](https://github.com/Gunale0926/Grams) | Adam/Adopt/Prodigy |\r\n| **AdEMAMix** | Dual EMA system that retains relevance of gradients over tens of thousands of steps | Long training runs, especially where model forgetting is a concern | +1 state memory | [AdEMAMix](https://arxiv.org/abs/2409.03137) | Adam/Adopt/Prodigy |\r\n| **Simplified_AdEMAMix** | Accumulator-based momentum, single EMA variant of AdEMAMix | All scenarios when tuned correctly | No overhead | [Connections](https://arxiv.org/abs/2502.02431) | Adam/Adopt/Prodigy |\r\n| **atan2** | Robust epsilon replacement with built-in gradient clipping | Use for stable bounded updates (or for Adopt as it needs that) | No overhead | [Adam-atan2](https://github.com/lucidrains/adam-atan2-pytorch) | Adam/Adopt/Prodigy |\r\n| **Kourkoutas-\u03b2** | Layer-wise adaptive \u03b2\u2082 based on gradient \u201csunspike\u201d ratio | Noisy/small/large-batch/high-LR training | No overhead | [Kourkoutas-\u03b2]() | Adam/Adopt/Prodigy/Simplified_AdEMAMix |\r\n\r\n> **Note**: If both **Cautious** and **Grams** are enabled, **Grams takes precedence** and Cautious is disabled.\r\n\r\n---\r\n\r\n## \ud83d\udd0d Feature Deep Dives\r\n\r\n### AdEMAMix\r\n\r\n- Adds a **slow-decaying second EMA** (`beta3`) that retains gradient memory over tens of thousands of steps.\r\n- Particularly effective for **small batch sizes**, where Adam\u2019s standard first moment is nearly useless.\r\n- **Reference**: [AdaMeM: Memory Efficient Momentum for Adafactor](https://openreview.net/forum?id=fZqMVTz7K5)\r\n\r\n#### Tunable Hyperparameters\r\n| Parameter | Default | Tuning Guide |\r\n|-----------|---------|--------------|\r\n| `beta3` | 0.9999 | \u2022 Runs >120k steps: **0.9999**<br>\u2022 Runs \u2264120k steps: **0.999** |\r\n| `alpha` | 5 | \u2022 Reduce to **2\u20133** if diverging<br>\u2022 Increase to strengthen long-term memory |\r\n\r\n> \u2705 **Pro Tip**: Set `beta1=0` in Adam/Adopt/Prodigy to skip standard EMA entirely and rely solely on AdEMAMix\u2019s slow EMA, ideal for small-batch regimes.\r\n\r\n---\r\n\r\n### Simplified_AdEMAMix\r\n\r\n- Introduced in [Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD Variants (arXiv:2502.02431)](https://arxiv.org/abs/2502.02431).\r\n- Replaces Adam\u2019s first moment with a **gradient accumulator**, combining the stability of long memory with responsiveness to recent gradients.\r\n- **Key insight**: Classical momentum **does not accelerate** in noisy (small-batch) regimes; this accumulator do.\r\n\r\n#### Tunable Hyperparameters\r\n| Parameter | Default | Tuning Guide |\r\n|----------|---------|--------------|\r\n| `beta1` | 0.99 | Controls accumulator memory length:<br>\u2022 Small BS: **0.99\u20130.9999**<br>\u2022 Large BS: **0.9** |\r\n| `Grad \u03b1` | 100 | Most critical parameter:<br>\u2022 Inversely scales with batch size<br>\u2022 **100\u201310** for small BS (\u226432)<br>\u2022 **1\u20130.1** for large BS (\u2265512) |\r\n\r\n> \u26a0\ufe0f **Critical**: Requires **~100x smaller learning rate** than AdamW (e.g., 1e-6 vs 1e-4).  \r\n> For `Prodigy_Adv`, set `initial_d` to:\r\n> - **LoRA**: `1e-8`\r\n> - **Full FT**: `1e-10`\r\n> - **Embedding**: `1e-7`\r\n\r\n> \u26a0\ufe0f **Incompatible** with: **Cautious**, **Grams**, **atan2**, and standard gradient clipping.\r\n\r\n#### Performance Validation\r\n\r\n**Small Batch Training (SDXL, BS=2, 1.8K steps)**  \r\n![Training Comparison](https://github.com/user-attachments/assets/7eff0671-cc59-47fc-8b63-d5205456d649)\r\n\r\n- **\ud83d\udfe2 Prodigy_Adv** (beta1=0.9, d0=1e-5): Final LR = 2.9e-4  \r\n- **\ud83d\udd35 Prodigy_Adv + Simplified_AdEMAMix** (beta1=0.99, \u03b1=100, d0=1e-7): Final LR = 5.8e-6\r\n\r\n**Results**:\r\n- Faster convergence and higher final performance with Simplified_AdEMAMix\r\n- D-Adaptation automatically compensates for aggressive updates\r\n- Generated samples show **significantly better quality**\r\n\r\n---\r\n\r\n### atan2\r\n\r\n- Replaces `eps` in Adam-family optimizers with a **scale-invariant**, bounded update rule.\r\n- Automatically clips updates to **[-2, 2]**, preventing destabilizing jumps.\r\n- **Highly recommended** for `Adopt_Adv`, which is prone to instability without clipping.\r\n\r\n---\r\n\r\n### **Kourkoutas-\u03b2**\r\n\r\n**Kourkoutas-\u03b2** introduces a **sunspike-driven, layer-wise adaptive second-moment decay (\u03b2\u2082)** as an optional enhancement for `Adam_Adv`, `Adopt_Adv`, `Prodigy_Adv`, and `Simplified_AdEMAMix`.\r\n\r\nInstead of using a fixed \u03b2\u2082 (e.g., 0.999 or 0.95), it **dynamically modulates \u03b2\u2082 per layer** based on a bounded *sunspike ratio*:\r\n\r\n- **During gradient bursts** \u2192 \u03b2\u2082 \u2193 toward `Lower \u03b2\u2082` \u2192 faster reaction  \r\n- **During calm phases** \u2192 \u03b2\u2082 \u2191 toward `The Selected \u03b2\u2082` \u2192 stronger smoothing  \r\n\r\nThis is especially effective for **noisy training, small batch sizes, and high learning rates**, where gradient norms shift abruptly due to noise or aggressive LR schedules.\r\n\r\n#### Pros/Cons\r\n\r\n| **Category** | **Details** |\r\n|--------------|-------------|\r\n| \u2705 **Pros** | \u2022 **Layer-wise adaptation** blends benefits of high \u03b2\u2082 (strong smoothing) and low \u03b2\u2082 (fast reaction).<br>\u2022 **Robust to sudden loss landscape shifts**, reacts quickly during gradient bursts, smooths during calm phases.<br>\u2022 **High tolerance to aggressive learning rates**. |\r\n| \u26a0\ufe0f **Cons** | \u2022 **Potentially unstable at the start of training** due to unreliable early gradient norms; mitigated by using `K-\u03b2 Warmup Steps`. |\r\n\r\n> \ud83d\udca1 **Best Practice**: Set `K_warmup_steps` equal to your standard LR warmup steps. During warmup, the optimizer uses the static `beta2`; adaptation begins only after warmup ends.\r\n\r\n> \ud83d\udd0d **Debugging Aid**: Enable `K_Logging` to monitor (min, max, mean) of dynamic \u03b2\u2082 values across layers every *N* steps.\r\n\r\n#### \ud83d\udcca Performance Validation\r\n\r\n**ADAMW_ADV - full SDXL finetuning (aggressive LR: 3e-5) (BS=4, 2.5K steps)**  \r\n<img width=\"1460\" height=\"382\" alt=\"image\" src=\"https://github.com/user-attachments/assets/007f278a-fbac-4f3d-9cc7-274c3b959cdd\" />\r\n\r\n- \ud83d\udfe3 Fixed `beta2=0.999`  \r\n- \ud83d\udfe0 Auto K-beta  \r\n\r\n**Observations:**  \r\n- K-beta is clearly better and more robust/stable for high LRs.\r\n\r\n> \ud83d\udcda **Reference**:  \r\n> - Paper: [Kourkoutas-\u03b2: A Sunspike-Driven Adam Optimizer with Desert Flair](https://arxiv.org/abs/2508.12996)  \r\n> - Code: [kbeta](https://github.com/sck-at-ucy/kbeta)\r\n\r\n---\r\n\r\n## Recommended Preset (Tested on LoRA/FT/Embedding)\r\n\r\n```yaml\r\nLearning Rate: 1\r\noptimizer: PRODIGY_Adv\r\nsettings:\r\n  - beta1: 0.99 # Controls momentum decay, ~100-step effective memory. Adjust to 0.999 (1000 steps) or 0.9999 (10000 steps) based on training length and stability needs.\r\n  - beta2: 0.999\r\n  - kourkoutas_beta: True   # For Kourkoutas-\u03b2\r\n  - K-\u03b2 Warmup Steps: 50    # Or 100, 200, depending on your run\r\n  - Simplified_AdEMAMix: True\r\n  - Grad \u03b1: 100\r\n  - OrthoGrad: True\r\n  - weight_decay: 0.0\r\n  - initial_d:\r\n      \u2022 LoRA: 1e-8\r\n      \u2022 Full fine-tune: 1e-10\r\n      \u2022 Embedding: 1e-7\r\n  - d_coef: 1\r\n  - d_limiter: True # To stablizie Prodigy with Simplified_AdEMAMix\r\n  - factored: False  # Can be true or false, quality should not degrade due to Simplified_AdEMAMix\u2019s high tolerance to 1-bit factorization.\r\n```\r\n\r\n> \u2705 **Why it works**:  \r\n> - `Kourkoutas-\u03b2` handles beta2 values\r\n> - `Simplified_AdEMAMix` ensures responsiveness in small-batch noise\r\n> - `OrthoGrad` prevents overfitting without weight decay\r\n\r\n---\r\n\r\n## \ud83d\udcda References\r\n\r\n1. [Revisiting BFloat16 Training](https://arxiv.org/abs/2010.06192)  \r\n2. [SMMF: Square-Matricized Momentum Factorization](https://arxiv.org/abs/2412.08894)  \r\n3. [The AdEMAMix Optimizer](https://arxiv.org/abs/2409.03137)  \r\n4. [Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD](https://arxiv.org/abs/2502.02431)  \r\n5. [AdaMeM: Memory Efficient Momentum for Adafactor](https://openreview.net/forum?id=fZqMVTz7K5)  \r\n6. [Kourkoutas-\u03b2: A Sunspike-Driven Adam Optimizer with Desert Flair](https://arxiv.org/abs/2508.12996)\r\n",
    "bugtrack_url": null,
    "license": "Apache 2.0",
    "summary": "A family of highly efficient, lightweight yet powerful optimizers.",
    "version": "1.1.3",
    "project_urls": {
        "Homepage": "https://github.com/Koratahiu/Advanced_Optimizers"
    },
    "split_keywords": [
        "llm",
        " fine-tuning",
        " memory-efficient",
        " low-rank",
        " compression",
        " pytorch",
        " optimizer",
        " adam"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "f13e84e4a726812f3c0d410e71c8b753d48c12f25a24dd39feb0530d81bb765c",
                "md5": "56ba70031712380854957a3512c1cacb",
                "sha256": "98bc438dfc0bd181a9bf3a93099427eb84d70b71379459cd73832b36b325cee1"
            },
            "downloads": -1,
            "filename": "adv_optm-1.1.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "56ba70031712380854957a3512c1cacb",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 45361,
            "upload_time": "2025-10-16T08:49:28",
            "upload_time_iso_8601": "2025-10-16T08:49:28.866524Z",
            "url": "https://files.pythonhosted.org/packages/f1/3e/84e4a726812f3c0d410e71c8b753d48c12f25a24dd39feb0530d81bb765c/adv_optm-1.1.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "72421e7b00e4ace473401f0953675a85bd38e392d1ff377e6268a6fe100bc9da",
                "md5": "edc3f7484c60652c3dfce32f69c56ae8",
                "sha256": "4783f18d5b6fffe36c6294fb93c85628b0941fe575ab18645f8aee6e8aeca8f8"
            },
            "downloads": -1,
            "filename": "adv_optm-1.1.3.tar.gz",
            "has_sig": false,
            "md5_digest": "edc3f7484c60652c3dfce32f69c56ae8",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 35449,
            "upload_time": "2025-10-16T08:49:31",
            "upload_time_iso_8601": "2025-10-16T08:49:31.630535Z",
            "url": "https://files.pythonhosted.org/packages/72/42/1e7b00e4ace473401f0953675a85bd38e392d1ff377e6268a6fe100bc9da/adv_optm-1.1.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-16 08:49:31",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Koratahiu",
    "github_project": "Advanced_Optimizers",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "adv-optm"
}
        
Elapsed time: 2.52362s