
Security News
The Nightmare Before Deployment
Season’s greetings from Socket, and here’s to a calm end of year: clean dependencies, boring pipelines, no surprises.
adv-optm
Advanced tools
A comprehensive, all-in-one collection of optimization algorithms for deep learning, designed for maximum efficiency, minimal memory footprint, and superior performance across diverse model architectures and training scenarios.
| Optimizer | Description |
|---|---|
Muon_adv | Advanced Muon implementation with CANS, NorMuon, Low-Rank ortho, etc. features. |
AdaMuon_adv | Advanced AdaMuon implementation, which combines Muon's geometry with Adam-like adaptive scaling and sign-based orthogonalization. |
Documentation coming soon.
Implemented Cautious Weight Decay for all advanced optimizers.
Improved parameter update and weight decay for BF16 with stochastic rounding. The updates are now accumulated in float32 and rounded once at the end.
Use fused and in-place operations whenever possible for all advanced optimizers.
Prodigy variants are now 50% faster by avoiding CUDA syncs. Thanks to @dxqb!
pip install adv_optm
This library integrates multiple state-of-the-art optimization techniques validated through extensive research and practical training, with 1-bit compression for optimizer states:
| Optimizer | Memory Usage | Description |
|---|---|---|
Adopt_Factored | 328 MB | 4 small vectors + 1-bit state |
Adopt_Factored + AdEMAMix | 625 MB | 6 small vectors + two 1-bit states |
Simplified_AdEMAMix | 328 MB | Same as standard factored (no extra state) |
| Optimizer | Speed | Notes |
|---|---|---|
Adafactor | ~8.5s/it | Baseline |
Adopt_Factored | ~10s/it | +18% overhead from compression |
Adopt_Factored + AdEMAMix | ~12s/it | +41% overhead (3 factored states) |
factored=True/False)| Optimizer | Description | Best For |
|---|---|---|
Adam_Adv | Advanced Adam implementation | General purpose |
Adopt_Adv | Adam-variant with independent beta2 | Stable training for small batch size regimes |
Prodigy_Adv | Prodigy with D-Adaptation | Adam with automatic LR tuning |
Simplified_AdEMAMix | Adam variant with accumulator momentum | Small/large batch training when tuned correctly |
Lion_Adv | Advanced Lion implementation | Memory-constrained environments |
Prodigy_Lion_Adv | Prodigy + Lion combination | Lion with automatic LR tuning |
| Feature | Adam_Adv | Adopt_Adv | Prodigy_Adv | Simplified_AdEMAMix | Lion_Adv |
|---|---|---|---|---|---|
| Factored | ✓ | ✓ | ✓ | ✓ | ✓ |
| AdEMAMix | ✓ | ✓ | ✓ | ✗ | ✗ |
| Simplified_AdEMAMix | ✗ | ✓ | ✓ | ✓ | ✗ |
| OrthoGrad | ✓ | ✓ | ✓ | ✓ | ✓ |
| Grams | ✓ | ✓ | ✓ | ✗ | ✗ |
| Cautious | ✓ | ✓ | ✓ | ✗ | ✓ |
| atan2 | ✓ | ✓ | ✓ | ✗ | ✗ |
| Stochastic Rounding | ✓ | ✓ | ✓ | ✓ | ✓ |
| Fused Backward Pass | ✓ | ✓ | ✓ | ✓ | ✓ |
| Kourkoutas-β | ✓ | ✓ | ✓ | ✓ | ✗ |
These features work with all optimizers and are generally safe to enable.
| Feature | Description | Recommended Usage | Performance Impact | Theoretical Basis | Compatibility |
|---|---|---|---|---|---|
| Fused Back Pass | Fuses backward pass; gradients used immediately and memory freed on-the-fly | Memory-constrained environments | Reduces peak memory | Memory optimization | All optimizers |
| Stochastic Rounding | Replaces nearest rounding with stochastic rounding to preserve small gradient updates in BF16 | BF16 training | Minimal overhead (<5%) | Revisiting BFloat16 Training | All optimizers |
| OrthoGrad | Removes gradient component parallel to weights to reduce overfitting | Full fine-tuning without weight decay | +33% time overhead (BS=4); less at larger BS | Grokking at Edge | All optimizers |
| Factored | Memory-efficient optimization via rank-1 1-bit factorization of optimizer states | Large models / memory-limited hardware | Adds compression overhead | SMMF | All optimizers |
| Feature | Description | Recommended Usage | Performance Impact | Theoretical Basis | Compatibility |
|---|---|---|---|---|---|
| Cautious | Only applies update if gradient direction aligns with momentum direction | Accelerating convergence | No overhead | C-Optim | Adam/Adopt/Prodigy/Lion |
| Grams | Update direction derived purely from current gradient | When Cautious is insufficient | No overhead | Grams | Adam/Adopt/Prodigy |
| AdEMAMix | Dual EMA system that retains relevance of gradients over tens of thousands of steps | Long training runs, especially where model forgetting is a concern | +1 state memory | AdEMAMix | Adam/Adopt/Prodigy |
| Simplified_AdEMAMix | Accumulator-based momentum, single EMA variant of AdEMAMix | All scenarios when tuned correctly | No overhead | Connections | Adam/Adopt/Prodigy |
| atan2 | Robust epsilon replacement with built-in gradient clipping | Use for stable bounded updates (or for Adopt as it needs that) | No overhead | Adam-atan2 | Adam/Adopt/Prodigy |
| Kourkoutas-β | Layer-wise adaptive β₂ based on gradient “sunspike” ratio | Noisy/small/large-batch/high-LR training | No overhead | Kourkoutas-β | Adam/Adopt/Prodigy/Simplified_AdEMAMix |
Note: If both Cautious and Grams are enabled, Grams takes precedence and Cautious is disabled.
beta3) that retains gradient memory over tens of thousands of steps.| Parameter | Default | Tuning Guide |
|---|---|---|
beta3 | 0.9999 | • Runs >120k steps: 0.9999 • Runs ≤120k steps: 0.999 |
alpha | 5 | • Reduce to 2–3 if diverging • Increase to strengthen long-term memory |
✅ Pro Tip: Set
beta1=0in Adam/Adopt/Prodigy to skip standard EMA entirely and rely solely on AdEMAMix’s slow EMA, ideal for small-batch regimes.
| Parameter | Default | Tuning Guide |
|---|---|---|
beta1 | 0.99 | Controls accumulator memory length: • Small BS: 0.99–0.9999 • Large BS: 0.9 |
Grad α | 100 | Most critical parameter: • Inversely scales with batch size • 100–10 for small BS (≤32) • 1–0.1 for large BS (≥512) |
⚠️ Critical: Requires ~100x smaller learning rate than AdamW (e.g., 1e-6 vs 1e-4). For
Prodigy_Adv, setinitial_dto:
- LoRA:
1e-8- Full FT:
1e-10- Embedding:
1e-7
⚠️ Incompatible with: Cautious, Grams, atan2, and standard update clipping.
eps in Adam-family optimizers with a scale-invariant, bounded update rule.Adopt_Adv, which is prone to instability without clipping.📚 Reference:
Kourkoutas-β introduces a sunspike-driven, layer-wise adaptive second-moment decay (β₂) as an optional enhancement for Adam_Adv, Adopt_Adv, Prodigy_Adv, and Simplified_AdEMAMix.
Instead of using a fixed β₂ (e.g., 0.999 or 0.95), it dynamically modulates β₂ per layer based on a bounded sunspike ratio:
Lower β₂ → faster reactionThe Selected β₂ → stronger smoothingThis is especially effective for noisy training, small batch sizes, and high learning rates, where gradient norms shift abruptly due to noise or aggressive LR schedules.
| Category | Details |
|---|---|
| ✅ Pros | • Layer-wise adaptation blends benefits of high β₂ (strong smoothing) and low β₂ (fast reaction). • Robust to sudden loss landscape shifts, reacts quickly during gradient bursts, smooths during calm phases. • High tolerance to aggressive learning rates. |
| ⚠️ Cons | • Potentially unstable at the start of training due to unreliable early gradient norms; mitigated by using K-β Warmup Steps. |
💡 Best Practice: Set
K_warmup_stepsequal to your standard LR warmup steps. During warmup, the optimizer uses the staticbeta2; adaptation begins only after warmup ends.
📚 Reference:
FAQs
A family of highly efficient, lightweight yet powerful optimizers.
We found that adv-optm demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Security News
Season’s greetings from Socket, and here’s to a calm end of year: clean dependencies, boring pipelines, no surprises.

Research
/Security News
Impostor NuGet package Tracer.Fody.NLog typosquats Tracer.Fody and its author, using homoglyph tricks, and exfiltrates Stratis wallet JSON/passwords to a Russian IP address.

Security News
Deno 2.6 introduces deno audit with a new --socket flag that plugs directly into Socket to bring supply chain security checks into the Deno CLI.