Socket
Book a DemoInstallSign in
Socket

adv-optm

Package Overview
Dependencies
Maintainers
1
Versions
73
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

adv-optm

A family of highly efficient, lightweight yet powerful optimizers.

pipPyPI
Version
1.4.0
Maintainers
1

Advanced Optimizers (AIO)

A comprehensive, all-in-one collection of optimization algorithms for deep learning, designed for maximum efficiency, minimal memory footprint, and superior performance across diverse model architectures and training scenarios.

PyPI

🔥 What's New in 1.2.x

  • Added advanced variants of Muon optimizer with features and settings from recent papers.
OptimizerDescription
Muon_advAdvanced Muon implementation with CANS, NorMuon, Low-Rank ortho, etc. features.
AdaMuon_advAdvanced AdaMuon implementation, which combines Muon's geometry with Adam-like adaptive scaling and sign-based orthogonalization.

Documentation coming soon.

  • Implemented Cautious Weight Decay for all advanced optimizers.

  • Improved parameter update and weight decay for BF16 with stochastic rounding. The updates are now accumulated in float32 and rounded once at the end.

  • Use fused and in-place operations whenever possible for all advanced optimizers.

  • Prodigy variants are now 50% faster by avoiding CUDA syncs. Thanks to @dxqb!

📦 Installation

pip install adv_optm

🧠 Core Innovations

This library integrates multiple state-of-the-art optimization techniques validated through extensive research and practical training, with 1-bit compression for optimizer states:

Memory-Efficient Optimization (SMMF-inspired)

  • Paper: SMMF: Square-Matricized Momentum Factorization
  • Approach: Uses rank-1 non-negative matrix factorization with reconstruction cycle (factor → reconstruct → update → factor)
  • Innovation:
    • First moment split into 1-bit sign + absolute value
    • Final storage: four factored vectors + one 1-bit sign state
    • Preserves Adam-like update quality with drastically reduced memory

⚡ Performance Characteristics

Memory Efficiency (SDXL Model – 6.5GB)

OptimizerMemory UsageDescription
Adopt_Factored328 MB4 small vectors + 1-bit state
Adopt_Factored + AdEMAMix625 MB6 small vectors + two 1-bit states
Simplified_AdEMAMix328 MBSame as standard factored (no extra state)

Speed Comparison (SDXL, Batch Size 4)

OptimizerSpeedNotes
Adafactor~8.5s/itBaseline
Adopt_Factored~10s/it+18% overhead from compression
Adopt_Factored + AdEMAMix~12s/it+41% overhead (3 factored states)

🧪 Available Optimizers

Standard Optimizers (All support factored=True/False)

OptimizerDescriptionBest For
Adam_AdvAdvanced Adam implementationGeneral purpose
Adopt_AdvAdam-variant with independent beta2Stable training for small batch size regimes
Prodigy_AdvProdigy with D-AdaptationAdam with automatic LR tuning
Simplified_AdEMAMixAdam variant with accumulator momentumSmall/large batch training when tuned correctly
Lion_AdvAdvanced Lion implementationMemory-constrained environments
Prodigy_Lion_AdvProdigy + Lion combinationLion with automatic LR tuning

⚙️ Feature Matrix

FeatureAdam_AdvAdopt_AdvProdigy_AdvSimplified_AdEMAMixLion_Adv
Factored
AdEMAMix
Simplified_AdEMAMix
OrthoGrad
Grams
Cautious
atan2
Stochastic Rounding
Fused Backward Pass
Kourkoutas-β

🛠️ Comprehensive Feature Guide

A. Universal Safe Features

These features work with all optimizers and are generally safe to enable.

FeatureDescriptionRecommended UsagePerformance ImpactTheoretical BasisCompatibility
Fused Back PassFuses backward pass; gradients used immediately and memory freed on-the-flyMemory-constrained environmentsReduces peak memoryMemory optimizationAll optimizers
Stochastic RoundingReplaces nearest rounding with stochastic rounding to preserve small gradient updates in BF16BF16 trainingMinimal overhead (<5%)Revisiting BFloat16 TrainingAll optimizers
OrthoGradRemoves gradient component parallel to weights to reduce overfittingFull fine-tuning without weight decay+33% time overhead (BS=4); less at larger BSGrokking at EdgeAll optimizers
FactoredMemory-efficient optimization via rank-1 1-bit factorization of optimizer statesLarge models / memory-limited hardwareAdds compression overheadSMMFAll optimizers

B. Individual Features

FeatureDescriptionRecommended UsagePerformance ImpactTheoretical BasisCompatibility
CautiousOnly applies update if gradient direction aligns with momentum directionAccelerating convergenceNo overheadC-OptimAdam/Adopt/Prodigy/Lion
GramsUpdate direction derived purely from current gradientWhen Cautious is insufficientNo overheadGramsAdam/Adopt/Prodigy
AdEMAMixDual EMA system that retains relevance of gradients over tens of thousands of stepsLong training runs, especially where model forgetting is a concern+1 state memoryAdEMAMixAdam/Adopt/Prodigy
Simplified_AdEMAMixAccumulator-based momentum, single EMA variant of AdEMAMixAll scenarios when tuned correctlyNo overheadConnectionsAdam/Adopt/Prodigy
atan2Robust epsilon replacement with built-in gradient clippingUse for stable bounded updates (or for Adopt as it needs that)No overheadAdam-atan2Adam/Adopt/Prodigy
Kourkoutas-βLayer-wise adaptive β₂ based on gradient “sunspike” ratioNoisy/small/large-batch/high-LR trainingNo overheadKourkoutas-βAdam/Adopt/Prodigy/Simplified_AdEMAMix

Note: If both Cautious and Grams are enabled, Grams takes precedence and Cautious is disabled.

🔍 Feature Deep Dives

AdEMAMix

  • Adds a slow-decaying second EMA (beta3) that retains gradient memory over tens of thousands of steps.
  • Particularly effective for small batch sizes, where Adam’s standard first moment is nearly useless.

Tunable Hyperparameters

ParameterDefaultTuning Guide
beta30.9999• Runs >120k steps: 0.9999
• Runs ≤120k steps: 0.999
alpha5• Reduce to 2–3 if diverging
• Increase to strengthen long-term memory

Pro Tip: Set beta1=0 in Adam/Adopt/Prodigy to skip standard EMA entirely and rely solely on AdEMAMix’s slow EMA, ideal for small-batch regimes.

Simplified_AdEMAMix

Tunable Hyperparameters

ParameterDefaultTuning Guide
beta10.99Controls accumulator memory length:
• Small BS: 0.99–0.9999
• Large BS: 0.9
Grad α100Most critical parameter:
• Inversely scales with batch size
100–10 for small BS (≤32)
1–0.1 for large BS (≥512)

⚠️ Critical: Requires ~100x smaller learning rate than AdamW (e.g., 1e-6 vs 1e-4). For Prodigy_Adv, set initial_d to:

  • LoRA: 1e-8
  • Full FT: 1e-10
  • Embedding: 1e-7

⚠️ Incompatible with: Cautious, Grams, atan2, and standard update clipping.

atan2

  • Replaces eps in Adam-family optimizers with a scale-invariant, bounded update rule.
  • Automatically clips updates to [-2, 2], preventing destabilizing jumps.
  • Highly recommended for Adopt_Adv, which is prone to instability without clipping.

📚 Reference:

Kourkoutas-β

Kourkoutas-β introduces a sunspike-driven, layer-wise adaptive second-moment decay (β₂) as an optional enhancement for Adam_Adv, Adopt_Adv, Prodigy_Adv, and Simplified_AdEMAMix.

Instead of using a fixed β₂ (e.g., 0.999 or 0.95), it dynamically modulates β₂ per layer based on a bounded sunspike ratio:

  • During gradient bursts → β₂ ↓ toward Lower β₂ → faster reaction
  • During calm phases → β₂ ↑ toward The Selected β₂ → stronger smoothing

This is especially effective for noisy training, small batch sizes, and high learning rates, where gradient norms shift abruptly due to noise or aggressive LR schedules.

Pros/Cons

CategoryDetails
ProsLayer-wise adaptation blends benefits of high β₂ (strong smoothing) and low β₂ (fast reaction).
Robust to sudden loss landscape shifts, reacts quickly during gradient bursts, smooths during calm phases.
High tolerance to aggressive learning rates.
⚠️ ConsPotentially unstable at the start of training due to unreliable early gradient norms; mitigated by using K-β Warmup Steps.

💡 Best Practice: Set K_warmup_steps equal to your standard LR warmup steps. During warmup, the optimizer uses the static beta2; adaptation begins only after warmup ends.

📚 Reference:

📚 References

Keywords

llm

FAQs

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts