Advanced Optimizers (AIO)
A comprehensive, all-in-one collection of optimization algorithms for deep learning, designed for maximum efficiency, minimal memory footprint, and superior performance across diverse model architectures and training scenarios.

🔥 What's New in 1.2.x
- Added advanced variants of Muon optimizer with features and settings from recent papers.
Muon_adv | Advanced Muon implementation with CANS, NorMuon, Low-Rank ortho, etc. features. |
AdaMuon_adv | Advanced AdaMuon implementation, which combines Muon's geometry with Adam-like adaptive scaling and sign-based orthogonalization. |
Documentation coming soon.
-
Implemented Cautious Weight Decay for all advanced optimizers.
-
Improved parameter update and weight decay for BF16 with stochastic rounding. The updates are now accumulated in float32 and rounded once at the end.
-
Use fused and in-place operations whenever possible for all advanced optimizers.
-
Prodigy variants are now 50% faster by avoiding CUDA syncs. Thanks to @dxqb!
📦 Installation
pip install adv_optm
🧠 Core Innovations
This library integrates multiple state-of-the-art optimization techniques validated through extensive research and practical training, with 1-bit compression for optimizer states:
Memory-Efficient Optimization (SMMF-inspired)
- Paper: SMMF: Square-Matricized Momentum Factorization
- Approach: Uses rank-1 non-negative matrix factorization with reconstruction cycle (factor → reconstruct → update → factor)
- Innovation:
- First moment split into 1-bit sign + absolute value
- Final storage: four factored vectors + one 1-bit sign state
- Preserves Adam-like update quality with drastically reduced memory
⚡ Performance Characteristics
Memory Efficiency (SDXL Model – 6.5GB)
Adopt_Factored | 328 MB | 4 small vectors + 1-bit state |
Adopt_Factored + AdEMAMix | 625 MB | 6 small vectors + two 1-bit states |
Simplified_AdEMAMix | 328 MB | Same as standard factored (no extra state) |
Speed Comparison (SDXL, Batch Size 4)
Adafactor | ~8.5s/it | Baseline |
Adopt_Factored | ~10s/it | +18% overhead from compression |
Adopt_Factored + AdEMAMix | ~12s/it | +41% overhead (3 factored states) |
🧪 Available Optimizers
Standard Optimizers (All support factored=True/False)
Adam_Adv | Advanced Adam implementation | General purpose |
Adopt_Adv | Adam-variant with independent beta2 | Stable training for small batch size regimes |
Prodigy_Adv | Prodigy with D-Adaptation | Adam with automatic LR tuning |
Simplified_AdEMAMix | Adam variant with accumulator momentum | Small/large batch training when tuned correctly |
Lion_Adv | Advanced Lion implementation | Memory-constrained environments |
Prodigy_Lion_Adv | Prodigy + Lion combination | Lion with automatic LR tuning |
⚙️ Feature Matrix
| Factored | ✓ | ✓ | ✓ | ✓ | ✓ |
| AdEMAMix | ✓ | ✓ | ✓ | ✗ | ✗ |
| Simplified_AdEMAMix | ✗ | ✓ | ✓ | ✓ | ✗ |
| OrthoGrad | ✓ | ✓ | ✓ | ✓ | ✓ |
| Grams | ✓ | ✓ | ✓ | ✗ | ✗ |
| Cautious | ✓ | ✓ | ✓ | ✗ | ✓ |
| atan2 | ✓ | ✓ | ✓ | ✗ | ✗ |
| Stochastic Rounding | ✓ | ✓ | ✓ | ✓ | ✓ |
| Fused Backward Pass | ✓ | ✓ | ✓ | ✓ | ✓ |
| Kourkoutas-β | ✓ | ✓ | ✓ | ✓ | ✗ |
🛠️ Comprehensive Feature Guide
A. Universal Safe Features
These features work with all optimizers and are generally safe to enable.
| Fused Back Pass | Fuses backward pass; gradients used immediately and memory freed on-the-fly | Memory-constrained environments | Reduces peak memory | Memory optimization | All optimizers |
| Stochastic Rounding | Replaces nearest rounding with stochastic rounding to preserve small gradient updates in BF16 | BF16 training | Minimal overhead (<5%) | Revisiting BFloat16 Training | All optimizers |
| OrthoGrad | Removes gradient component parallel to weights to reduce overfitting | Full fine-tuning without weight decay | +33% time overhead (BS=4); less at larger BS | Grokking at Edge | All optimizers |
| Factored | Memory-efficient optimization via rank-1 1-bit factorization of optimizer states | Large models / memory-limited hardware | Adds compression overhead | SMMF | All optimizers |
B. Individual Features
| Cautious | Only applies update if gradient direction aligns with momentum direction | Accelerating convergence | No overhead | C-Optim | Adam/Adopt/Prodigy/Lion |
| Grams | Update direction derived purely from current gradient | When Cautious is insufficient | No overhead | Grams | Adam/Adopt/Prodigy |
| AdEMAMix | Dual EMA system that retains relevance of gradients over tens of thousands of steps | Long training runs, especially where model forgetting is a concern | +1 state memory | AdEMAMix | Adam/Adopt/Prodigy |
| Simplified_AdEMAMix | Accumulator-based momentum, single EMA variant of AdEMAMix | All scenarios when tuned correctly | No overhead | Connections | Adam/Adopt/Prodigy |
| atan2 | Robust epsilon replacement with built-in gradient clipping | Use for stable bounded updates (or for Adopt as it needs that) | No overhead | Adam-atan2 | Adam/Adopt/Prodigy |
| Kourkoutas-β | Layer-wise adaptive β₂ based on gradient “sunspike” ratio | Noisy/small/large-batch/high-LR training | No overhead | Kourkoutas-β | Adam/Adopt/Prodigy/Simplified_AdEMAMix |
Note: If both Cautious and Grams are enabled, Grams takes precedence and Cautious is disabled.
🔍 Feature Deep Dives
AdEMAMix
- Adds a slow-decaying second EMA (
beta3) that retains gradient memory over tens of thousands of steps.
- Particularly effective for small batch sizes, where Adam’s standard first moment is nearly useless.
Tunable Hyperparameters
beta3 | 0.9999 | • Runs >120k steps: 0.9999 • Runs ≤120k steps: 0.999 |
alpha | 5 | • Reduce to 2–3 if diverging • Increase to strengthen long-term memory |
✅ Pro Tip: Set beta1=0 in Adam/Adopt/Prodigy to skip standard EMA entirely and rely solely on AdEMAMix’s slow EMA, ideal for small-batch regimes.
Simplified_AdEMAMix
Tunable Hyperparameters
beta1 | 0.99 | Controls accumulator memory length: • Small BS: 0.99–0.9999 • Large BS: 0.9 |
Grad α | 100 | Most critical parameter: • Inversely scales with batch size • 100–10 for small BS (≤32) • 1–0.1 for large BS (≥512) |
⚠️ Critical: Requires ~100x smaller learning rate than AdamW (e.g., 1e-6 vs 1e-4).
For Prodigy_Adv, set initial_d to:
- LoRA:
1e-8
- Full FT:
1e-10
- Embedding:
1e-7
⚠️ Incompatible with: Cautious, Grams, atan2, and standard update clipping.
atan2
- Replaces
eps in Adam-family optimizers with a scale-invariant, bounded update rule.
- Automatically clips updates to [-2, 2], preventing destabilizing jumps.
- Highly recommended for
Adopt_Adv, which is prone to instability without clipping.
📚 Reference:
Kourkoutas-β
Kourkoutas-β introduces a sunspike-driven, layer-wise adaptive second-moment decay (β₂) as an optional enhancement for Adam_Adv, Adopt_Adv, Prodigy_Adv, and Simplified_AdEMAMix.
Instead of using a fixed β₂ (e.g., 0.999 or 0.95), it dynamically modulates β₂ per layer based on a bounded sunspike ratio:
- During gradient bursts → β₂ ↓ toward
Lower β₂ → faster reaction
- During calm phases → β₂ ↑ toward
The Selected β₂ → stronger smoothing
This is especially effective for noisy training, small batch sizes, and high learning rates, where gradient norms shift abruptly due to noise or aggressive LR schedules.
Pros/Cons
| ✅ Pros | • Layer-wise adaptation blends benefits of high β₂ (strong smoothing) and low β₂ (fast reaction). • Robust to sudden loss landscape shifts, reacts quickly during gradient bursts, smooths during calm phases. • High tolerance to aggressive learning rates. |
| ⚠️ Cons | • Potentially unstable at the start of training due to unreliable early gradient norms; mitigated by using K-β Warmup Steps. |
💡 Best Practice: Set K_warmup_steps equal to your standard LR warmup steps. During warmup, the optimizer uses the static beta2; adaptation begins only after warmup ends.
📚 Reference:
📚 References