
Security News
Crates.io Users Targeted by Phishing Emails
The Rust Security Response WG is warning of phishing emails from rustfoundation.dev targeting crates.io users.
This repo contains an implementation of the Muon
optimizer originally described in this thread and this writeup.
pip install git+https://github.com/KellerJordan/Muon
Muon is intended to optimize only the internal ≥2D parameters of a network. Embeddings, classifier heads, and internal gains/biases should be optimized using AdamW.
# optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, betas=(0.90, 0.95), weight_decay=0.01)
from muon import MuonWithAuxAdam
# Find ≥2D parameters in the body of the network -- these should be optimized by Muon
hidden_weights = [p for p in model.body.parameters() if p.ndim >= 2]
# Find everything else -- these should be optimized by AdamW
hidden_gains_biases = [p for p in model.body.parameters() if p.ndim < 2]
exterior_weights = [*model.head.parameters(), *model.embed.parameters()])
# Create the optimizer
# Note: you can also use multiple groups of each type with different hparams if you want.
muon_group = dict(params=hidden_weights, lr=0.02, weight_decay=0.01, use_muon=True)
adam_group = dict(params=hidden_gains_biases+exterior_weights, lr=3e-4,
betas=(0.9, 0.95), weight_decay=0.01, use_muon=False)
optimizer = MuonWithAuxAdam([muon_group, adam_group])
You'll have to replace model.body
, model.head
, and model.embed
with whatever subset is appropriate for your model.
E.g., for a ConvNet, Muon should optimize all the convolutional filters except the first one, and AdamW should optimize everything else.
Example use in the NanoGPT speedrun
Example use in the CIFAR-10 speedrun
Typically, the default values of momentum (0.95), nesterov (True), and ns_steps (5) work well. The only hyperparameter which must be tuned is the learning rate. It should have constant muP scaling, that is, as you scale up the model size, you shouldn't need to retune the learning rate.
For a comparison between AdamW, Shampoo, SOAP, and Muon for training a 124M-parameter transformer, see here.
@misc{jordan2024muon,
author = {Keller Jordan and Yuchen Jin and Vlado Boza and You Jiacheng and
Franz Cesista and Laker Newhouse and Jeremy Bernstein},
title = {Muon: An optimizer for hidden layers in neural networks},
year = {2024},
url = {https://kellerjordan.github.io/posts/muon/}
}
FAQs
Muon opimizer
We found that muon-optimizer demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
The Rust Security Response WG is warning of phishing emails from rustfoundation.dev targeting crates.io users.
Product
Socket now lets you customize pull request alert headers, helping security teams share clear guidance right in PRs to speed reviews and reduce back-and-forth.
Product
Socket's Rust support is moving to Beta: all users can scan Cargo projects and generate SBOMs, including Cargo.toml-only crates, with Rust-aware supply chain checks.