On RMS matched Muon
Meta
Since it's been a while since my last post, I'm trying to move toward shorter pieces instead of letting longer, more ambitious drafts keep accumulating. Part of the motivation is also to avoid the failure modes described in this post. Starting with this post, most posts should be modest in scope, while hopefully still useful and easier to read.TL;DR
Relying on RMS-matched Muon may lead to either instability/suboptimality or undertraining, depending on model width.
Introduction
As the Muon optimizer (Jordan et al. 2024) became popular over the last year and a half, mainly due to its great performance across tasks and at scale, there has been an increasing interest in people wanting to try it on their own setups, which usually have AdamW as the optimizer of choice.
Moonshot AI's paper on scaling Muon thus helpfully provided a way to roughly map people's AdamW hyperparameters to Muon, by heuristically matching the update size. This is their Moonlight formulation (Liu et al. 2025; Su 2025a, English translation, original Chinese). Additionally, there is a good discussion of different Muon variants at Su 2025c, English translation (original Chinese).
In their Moonlight formulation, the update size is matched in an RMS sense, i.e., the desiderata is that for a matrix in being optimized by Muon, the RMS of the update should be the same whether you use Muon or AdamW.
AdamW's update size, in the steady state, assuming low-SNR/approximately zero-mean gradients and ignoring bias correction and , is relatively easy to compute (Su 2025b, English translation, original Chinese), and the RMS ends up being roughly where is the learning rate and is AdamW's . For usual , this becomes .
Now, since Muon's update is full rank (i.e., ) and also has all singular values close to (modulo very small singular values), the squared Frobenius norm of the matrix becomes the sum of squares of the singular values, which is roughly . This means the RMS norm of the matrix is . Hence, to match the RMS, the Moonlight update (with weight decay dropped for simplicity) looks something like , where is the Muon semi-orthogonal matrix.
What goes wrong
We first need to figure out learning rate transfer for Muon with width for the main argument.
Consider the spectral condition for feature learning paper (Yang, Simon, and Bernstein 2023). It says that the RMS-RMS operator norm should be constant for a weight matrix in the hidden layers (i.e., the norm that's relevant is an operator norm, not necessarily a Frobenius norm). This translates to spectral norm scaling as with width, and for a constant aspect ratio, this is a constant (so we will ignore this for AdamW too). Since the Muon semi-orthogonal matrix's spectral norm is ~ , Muon's learning rate should transfer without any dependence on width. This is also observed in practice (Jordan et al. 2024).
However, with AdamW, the stable rank of the update is practically observed to scale as a constant, so the Frobenius norm and the spectral norm are only a constant factor apart (Yang, Simon, and Bernstein 2023). This means that the learning rate of Adam should scale as the inverse of the Frobenius norm of the update, i.e., inversely proportional to the width of the layer.
Let's now do a thought experiment, where we train a series of models with P scaling for AdamW, and then use their learning rates for RMS-matched Muon (i.e., the Moonlight formulation).
The learning rate in front of the actual semi-orthogonal Muon matrix then scales as where is the width of the layer. This means that larger models might be undertrained, and smaller models might have training instabilities/suboptimality.
The primary reason why this happens is that there is a separation between the Frobenius and spectral norms for Muon (due to its high stable rank), but not for Adam. This also leads to other issues (not just this issue) and has effects on weight geometry and such when considering weight decay, which we'll discuss in a later post.
So why do people use it anyway? Maybe the width scaling factor is small enough, or that Muon's learning rate stability basin (i.e., logarithmic range of learning rate such that the loss is a fixed amount above the minimum) is broad enough that this isn't visible/does not matter.
In any case, it's important to know such caveats when using this formulation with a scaling recipe fitted for AdamW. Using the Keller Jordan variant or the P variant should help learning rate transfer with width for free, so it's preferable to use them instead in general.
References
- Jordan, Keller, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. 2024. "Muon: An optimizer for hidden layers in neural networks."
- Jordan, Keller et al. 2024. "Muon" GitHub repository.
- Liu, Jingyuan, Jianlin Su, Xingcheng Yao, et al. 2025. "Muon is Scalable for LLM Training."
- Su, Jianlin. 2025a. "Why Did We Choose to Try Muon?" English translation. Original Chinese: 《Muon续集:为什么我们选择尝试Muon?》
- Su, Jianlin. 2025b. "Why is Adam's Update RMS 0.2?" English translation. Original Chinese: 《为什么Adam的Update RMS是0.2?》
- Su, Jianlin. 2025c. "Muon Optimizer Guide - Quick Start and Key Details." English translation. Original Chinese: 《Muon优化器指南:快速上手与关键细节》
- Yang, Greg, James B. Simon, and Jeremy Bernstein. 2023. "A Spectral Condition for Feature Learning."
Cite this post
@online{rms-matched-muon,
author = {nor},
title = {On RMS matched Muon},
year = {2026},
month = {05},
day = {13},
url = {https://nor-blog.pages.dev/posts/2026-05-13-rms-matched-muon/},
}