Remark #1: On RMS matched Muon

May 13, 2026 · 6 min · 1096 words · nor

Table of Contents

TL;DR
Introduction
What goes wrong
References

TL;DR

Relying on RMS-matched Muon may lead to either instability/suboptimality or undertraining, depending on model width.

Introduction

As the Muon optimizer (Jordan et al. 2024) became popular over the last year and a half, mainly due to its great performance across tasks and at scale, there has been an increasing interest in people wanting to try it on their own setups, which usually have AdamW as the optimizer of choice.

Moonshot AI's paper on scaling Muon thus helpfully provided a way to roughly map people's AdamW hyperparameters to Muon, by heuristically matching the update size. This is their Moonlight formulation (Liu et al. 2025; Su 2025a, English translation, original Chinese). Additionally, there is a good discussion of different Muon variants at Su 2025c, English translation (original Chinese).

In their Moonlight formulation, the update size is matched in an RMS sense, i.e., the desiderata is that for a matrix in $\mathbb{R}^{m \times n}$ being optimized by Muon, the RMS of the update should be the same whether you use Muon or AdamW.

AdamW's update size, in the steady state, assuming low-SNR/approximately zero-mean gradients and ignoring bias correction and $\epsilon$ , is relatively easy to compute (Su 2025b, English translation, original Chinese), and the RMS ends up being roughly $\eta \sqrt{\frac{1 - \beta_1}{1 + \beta_1}}$ where $\eta$ is the learning rate and $\beta_1$ is AdamW's $\beta_1$ . For usual $\beta_1 = 0.9$ , this becomes $\approx 0.2 \eta$ .

Now, since Muon's update is full rank (i.e., $\min(n, m)$ ) and also has all singular values close to $1$ (modulo very small singular values), the squared Frobenius norm of the matrix becomes the sum of squares of the singular values, which is roughly $\min(n, m)$ . This means the RMS norm of the matrix is $\frac{1}{\sqrt{\max(n, m)}}$ . Hence, to match the RMS, the Moonlight update (with weight decay dropped for simplicity) looks something like $\theta \gets \theta - 0.2 \eta \sqrt{\max(n, m)} U$ , where $U$ is the Muon semi-orthogonal matrix.

What goes wrong

We first need to figure out learning rate transfer for Muon with width for the main argument.

Consider the spectral condition for feature learning paper (Yang, Simon, and Bernstein 2023). It says that the RMS-RMS operator norm should be constant for a weight matrix in the hidden layers (i.e., the norm that's relevant is an operator norm, not necessarily a Frobenius norm). This translates to spectral norm scaling as $\sqrt{\frac{m}{n}}$ with width, and for a constant aspect ratio, this is a constant (so we will ignore this for AdamW too). Since the Muon semi-orthogonal matrix's spectral norm is ~ $1$ , Muon's learning rate should transfer without any dependence on width. This is also observed in practice (Jordan et al. 2024).

However, with AdamW, the stable rank of the update is practically observed to scale as a constant, so the Frobenius norm and the spectral norm are only a constant factor apart (Yang, Simon, and Bernstein 2023). This means that the learning rate of Adam should scale as the inverse of the Frobenius norm of the update, i.e., inversely proportional to the width of the layer.

Let's now do a thought experiment, where we train a series of models with $μ$ P scaling for AdamW, and then use their learning rates for RMS-matched Muon (i.e., the Moonlight formulation).

The learning rate in front of the actual semi-orthogonal Muon matrix then scales as $\frac{1}{\sqrt{d}}$ where $d$ is the width of the layer. This means that larger models might be undertrained, and smaller models might have training instabilities/suboptimality.

The primary reason why this happens is that there is a separation between the Frobenius and spectral norms for Muon (due to its high stable rank), but not for Adam. This also leads to other issues (not just this issue) and has effects on weight geometry and such when considering weight decay, which we'll discuss in a later post.

So why do people use it anyway? Maybe the width scaling factor is small enough, or that Muon's learning rate stability basin (i.e., logarithmic range of learning rate such that the loss is a fixed amount above the minimum) is broad enough that this isn't visible/does not matter.

In any case, it's important to know such caveats when using this formulation with a scaling recipe fitted for AdamW. Using the Keller Jordan variant or the $μ$ P variant should help learning rate transfer with width for free, so it's preferable to use them instead in general. This is pretty common knowledge (e.g., one reference is Qiu et al. 2025 as pointed out by @ShikaiQiu after publication of the post), but I still see a surprising number of people make this mistake, which prompted me to write this post in the first place.

References

Cite this post

@online{rms-matched-muon,
  author    = {nor},
  title     = {Remark #1: On RMS matched Muon},
  year      = {2026},
  month     = {05},
  day       = {13},
  url       = {https://nor-blog.pages.dev/posts/2026-05-13-rms-matched-muon/},
}