On RMS matched Muon

May 13, 2026 · 5 min · 1024 words · nor

Meta Since it's been a while since my last post, I'm trying to move toward shorter pieces instead of letting longer, more ambitious drafts keep accumulating. Part of the motivation is also to avoid the failure modes described in this post. Starting with this post, most posts should be modest in scope, while hopefully still useful and easier to read.

TL;DR

Relying on RMS-matched Muon may lead to either instability/suboptimality or undertraining, depending on model width.

Introduction

As the Muon optimizer (Jordan et al. 2024) became popular over the last year and a half, mainly due to its great performance across tasks and at scale, there has been an increasing interest in people wanting to try it on their own setups, which usually have AdamW as the optimizer of choice.

Moonshot AI's paper on scaling Muon thus helpfully provided a way to roughly map people's AdamW hyperparameters to Muon, by heuristically matching the update size. This is their Moonlight formulation (Liu et al. 2025; Su 2025a, English translation, original Chinese). Additionally, there is a good discussion of different Muon variants at Su 2025c, English translation (original Chinese).

In their Moonlight formulation, the update size is matched in an RMS sense, i.e., the desiderata is that for a matrix in m×n\mathbb{R}^{m \times n} being optimized by Muon, the RMS of the update should be the same whether you use Muon or AdamW.

AdamW's update size, in the steady state, assuming low-SNR/approximately zero-mean gradients and ignoring bias correction and ϵ\epsilon, is relatively easy to compute (Su 2025b, English translation, original Chinese), and the RMS ends up being roughly η1β11+β1\eta \sqrt{\frac{1 - \beta_1}{1 + \beta_1}} where η\eta is the learning rate and β1\beta_1 is AdamW's β1\beta_1. For usual β1=0.9\beta_1 = 0.9, this becomes 0.2η\approx 0.2 \eta.

Now, since Muon's update is full rank (i.e., min(n,m)\min(n, m)) and also has all singular values close to 11 (modulo very small singular values), the squared Frobenius norm of the matrix becomes the sum of squares of the singular values, which is roughly min(n,m)\min(n, m). This means the RMS norm of the matrix is 1max(n,m)\frac{1}{\sqrt{\max(n, m)}}. Hence, to match the RMS, the Moonlight update (with weight decay dropped for simplicity) looks something like θθ0.2ηmax(n,m)U\theta \gets \theta - 0.2 \eta \sqrt{\max(n, m)} U, where UU is the Muon semi-orthogonal matrix.

What goes wrong

We first need to figure out learning rate transfer for Muon with width for the main argument.

Consider the spectral condition for feature learning paper (Yang, Simon, and Bernstein 2023). It says that the RMS-RMS operator norm should be constant for a weight matrix in the hidden layers (i.e., the norm that's relevant is an operator norm, not necessarily a Frobenius norm). This translates to spectral norm scaling as mn\sqrt{\frac{m}{n}} with width, and for a constant aspect ratio, this is a constant (so we will ignore this for AdamW too). Since the Muon semi-orthogonal matrix's spectral norm is ~ 11, Muon's learning rate should transfer without any dependence on width. This is also observed in practice (Jordan et al. 2024).

However, with AdamW, the stable rank of the update is practically observed to scale as a constant, so the Frobenius norm and the spectral norm are only a constant factor apart (Yang, Simon, and Bernstein 2023). This means that the learning rate of Adam should scale as the inverse of the Frobenius norm of the update, i.e., inversely proportional to the width of the layer.

Let's now do a thought experiment, where we train a series of models with μμP scaling for AdamW, and then use their learning rates for RMS-matched Muon (i.e., the Moonlight formulation).

The learning rate in front of the actual semi-orthogonal Muon matrix then scales as 1d\frac{1}{\sqrt{d}} where dd is the width of the layer. This means that larger models might be undertrained, and smaller models might have training instabilities/suboptimality.

The primary reason why this happens is that there is a separation between the Frobenius and spectral norms for Muon (due to its high stable rank), but not for Adam. This also leads to other issues (not just this issue) and has effects on weight geometry and such when considering weight decay, which we'll discuss in a later post.

So why do people use it anyway? Maybe the width scaling factor is small enough, or that Muon's learning rate stability basin (i.e., logarithmic range of learning rate such that the loss is a fixed amount above the minimum) is broad enough that this isn't visible/does not matter.

In any case, it's important to know such caveats when using this formulation with a scaling recipe fitted for AdamW. Using the Keller Jordan variant or the μμP variant should help learning rate transfer with width for free, so it's preferable to use them instead in general.

References

Cite this post
@online{rms-matched-muon,
  author    = {nor},
  title     = {On RMS matched Muon},
  year      = {2026},
  month     = {05},
  day       = {13},
  url       = {https://nor-blog.pages.dev/posts/2026-05-13-rms-matched-muon/},
}