Theoretical properties of optimizers on a toy problem, and some intuition

August 2, 2025 · 48 min · 10137 words · nor

Introduction

We want to answer the question: how do optimizers fundamentally differ in their approach to finding a minimum? To explore this question in a controlled environment, we focus on a simple problem: minimizing the matrix error term minAXB\min \lVert AX - B \rVert.

This problem serves as a toy task for theoretical exploration because it is tractable, in part due to its simplicity. By assuming the matrix AA is symmetric and positive definite (SPD), we guarantee a unique, known solution X*=A1BX^* = A^{-1}B, allowing us to measure things quite precisely. The problem is convex, which removes the complexities of local minima and lets us focus purely on the dynamics of convergence --- we make this compromise because non-convexity makes theoretical analysis significantly harder, but in no way does this mean that non-convex problems are not important; most interesting optimization problems are non-convex. Furthermore, by allowing the error AXB\lVert AX - B \rVert to be measured by a wide range of norms (both Schatten and element-wise), we can study how the problem's "shape" influences optimizer performance.

Our analysis centers on a family of what we will informally call "geometric optimizers". These are algorithms whose update direction is determined by normalizing the gradient in a manner according to a chosen matrix norm. This family includes well-known methods like Normalized Gradient Descent (which uses a standard Euclidean "geometry"), the Muon optimizer (which uses a spectral "geometry"), and SignSGD (which uses an element-wise "geometry" and is the special case of Adam without momentum).

The good news is that, for this problem, any optimizer in this family is guaranteed to find the exact solution to this toy problem in a finite amount of time. This stands in sharp contrast to standard gradient descent, which only approaches the solution asymptotically as time goes to infinity. We show that the total convergence time is bounded and is determined by the relationship between the optimizer's chosen update norm and the problem's loss norm. In particular, we will see how Muon's theoretical convergence guarantees are nicer than Adam's.

This post is structured to build this from the ground up. We will:

  • Formally define the optimization problem and the class of geometric optimizers.
  • Present a general theorem that provides upper and lower bounds on the convergence time.
  • Apply this theorem to Normalized Gradient Descent, Muon, and Adam/SignSGD to derive concrete, comparable performance bounds.
  • Using these results and otherwise, discuss the different motivations behind these optimizers and build some intuition about their behavior.

It is important to state at the outset that our analysis is set in an idealized continuous-time framework, which has a lot of simplifications at the very outset (no discrete step size pathologies, no noise, very simple loss landscape). The primary goal is to provide a mathematical lens for understanding why and how different optimizers behave differently. The intuition developed here might serve as a valuable foundation for thinking about their discrete-time and stochastic counterparts used in practice. Be warned, though --- intuition built from toy problems can only go so far.

The Setting

The Optimization Problem

We investigate the following convex optimization problem:

minXn×mL(X)AXBα \min_{X \in \mathbb{R}^{n \times m}} L(X) \coloneqq \lVert AX - B \rVert_{\alpha}

Here, the matrix An×nA \in \mathbb{R}^{n \times n} is assumed to be symmetric and positive definite (SPD). This is an important assumption, implying that AA is invertible and all its eigenvalues are strictly positive. The matrices Xn×mX \in \mathbb{R}^{n \times m} and Bn×mB \in \mathbb{R}^{n \times m} represent the variable and the target, respectively. The objective function L(X)L(X) measures the "size" of the residual matrix AXBAX - B using a chosen matrix norm, α\lVert \cdot \rVert_{\alpha}. Our analysis covers two primary families of norms: Schatten-p norms and element-wise p-norms.

Since AA is invertible, a unique solution X*X^* exists that makes the residual zero: AX*B=0AX^* - B = 0. This solution is X*=A1BX^* = A^{-1}B. At this point, the loss function attains its global minimum value of L(X*)=A(A1B)Bα=0α=0L(X^*) = \lVert A(A^{-1}B) - B \rVert_{\alpha} = \lVert 0 \rVert_{\alpha} = 0. The goal of an optimizer is to find this unique solution X*X^*.

Definition 2.1. (Schatten-p norm) The Schatten-p norm, denoted Sp\lVert \cdot \rVert_{S_p}, is a norm on the space of matrices, defined via their singular values. For any matrix Zn×mZ \in \mathbb{R}^{n \times m}, let its singular values be σ1(Z)σ2(Z)σr(Z)>0\sigma_1(Z) \ge \sigma_2(Z) \ge \dots \ge \sigma_r(Z) > 0, where r=rank(Z)r = \text{rank}(Z). The Schatten-p norm for p[1,]p \in [1, \infty] is defined as:

ZSp(i=1rσi(Z)p)1/p \lVert Z \rVert_{S_p} \coloneqq \left( \sum_{i=1}^r \sigma_i(Z)^p \right)^{1/p}

This is equivalent to the LpL_p norm of the vector of singular values. Notable cases include the Nuclear norm (p=1p=1, denoted *\lVert \cdot \rVert_*), the Frobenius norm (p=2p=2, denoted F\lVert \cdot \rVert_F), and the Spectral norm (p=p=\infty). The dual (we will explain this in a later section) of the Schatten-p norm is the Schatten-q norm, where pp and qq are Hölder conjugates, satisfying 1/p+1/q=11/p + 1/q = 1.

Definition 2.2. (Element-wise p-norm) The element-wise p-norm, denoted e,p\lVert \cdot \rVert_{e,p}, is a norm on the space of matrices defined by treating the matrix as a vector in mn\mathbb{R}^{mn}. For any matrix Zn×mZ \in \mathbb{R}^{n \times m}, the element-wise p-norm for p[1,]p \in [1, \infty] is defined as:

Ze,p(i=1nj=1m|Zij|p)1/p \lVert Z \rVert_{e,p} \coloneqq \left( \sum_{i=1}^n \sum_{j=1}^m |Z_{ij}|^p \right)^{1/p}

This is the standard LpL_p norm of the vectorized matrix. The case p=2p=2 corresponds to the Frobenius norm, Ze,2=ZF\lVert Z \rVert_{e,2} = \lVert Z \rVert_F. The dual of the element-wise p-norm is the element-wise q-norm, where 1/p+1/q=11/p + 1/q = 1.

Subgradient of the Loss Function

To minimize L(X)L(X), we need its gradient. However, for p2p \neq 2, neither the Schatten nor element-wise p-norms are differentiable everywhere. We therefore use the more general concept of the subgradient.

The subgradient of a convex function ff at a point xx is a vector gg such that f(y)f(x)+gT(yx)f(y) \ge f(x) + g^T(y-x) for all yy. The set of all such vectors is called the subdifferential, denoted f(x)\partial f(x). For our loss function L(X)=h(g(X))L(X) = h(g(X)) where h(Z)=Zαh(Z) = \lVert Z \rVert_{\alpha} and g(X)=AXBg(X) = AX - B, the chain rule for subdifferentials states that XL(X)=ATZh(AXB)\partial_X L(X) = A^T \partial_Z h(AX-B).

Let δ\lVert \cdot \rVert_\delta be the dual norm to the loss norm α\lVert \cdot \rVert_\alpha. (If α\alpha is SpS_p, δ\delta is SqS_q; if α\alpha is e,pe,p, δ\delta is e,qe,q). Almost by definition of the dual norm (which we give later), the subdifferential of the norm α\lVert \cdot \rVert_\alpha at a matrix Z0Z \neq 0 is given by where (S,ZF=Tr(STZ)\langle S, Z \rangle_F = \text{Tr}(S^T Z) is the standard Frobenius inner product):

ZZα={Sn×mSδ=1 and S,ZF=Zα} \partial_Z \lVert Z \rVert_{\alpha} = \{ S \in \mathbb{R}^{n \times m} \mid \lVert S \rVert_{\delta} = 1 \text{ and } \langle S, Z \rangle_F = \lVert Z \rVert_{\alpha} \}

Consequently, the subdifferential of our loss function L(X)L(X) (for XX*X \neq X^*) is: XL(X)={ATSSn×m,Sδ=1,S,AXBF=AXBα} \partial_X L(X) = \{ A^T S \mid S \in \mathbb{R}^{n \times m}, \lVert S \rVert_{\delta} = 1, \langle S, AX-B \rangle_F = \lVert AX-B \rVert_{\alpha} \} Since our matrix AA is symmetric, AT=AA^T = A. For the purpose of analyzing a continuous-time gradient flow, we can consider an optimizer that, at each time tt, selects one specific subgradient GtXL(Xt)G_t \in \partial_X L(X_t) to determine its direction.

The Family of Geometric Optimizers

We define a family of optimizers based on the principle of continuous steepest descent. The direction of steepest descent depends on the geometry of the underlying space, which we define by choosing a norm, β\lVert \cdot \rVert_\beta, on the space of matrices n×m\mathbb{R}^{n \times m}.

Definition 2.3. (Dual norm) For any norm β\lVert \cdot \rVert_\beta on a vector space, its dual norm, denoted γ\lVert \cdot \rVert_\gamma, is defined for a matrix GG as:

GγsupZ:Zβ1G,ZF \lVert G \rVert_\gamma \coloneqq \sup_{Z : \lVert Z \rVert_\beta \le 1} \langle G, Z \rangle_F

where G,ZF=Tr(GTZ)\langle G, Z \rangle_F = \text{Tr}(G^T Z) is the standard Frobenius inner product. The dual norm measures the maximum projection of GG onto the unit ball of the primal norm β\lVert \cdot \rVert_\beta.

It is left to the curious reader to verify that such a mapping is indeed a norm, and all the properties of dual norms mentioned in the previous section.

Definition 2.4. (Dualizer) The dualizer for a norm β\lVert \cdot \rVert_\beta is a map, dualizerβ()\text{dualizer}_{\lVert \cdot \rVert_\beta}(\cdot), which for any input matrix G0G \neq 0, returns the direction ZZ that achieves the supremum in the dual norm definition. That is, dualizerβ(G)\text{dualizer}_{\lVert \cdot \rVert_\beta}(G) is a matrix ZZ such that:

Zβ=1andG,ZF=Gγ \lVert Z \rVert_\beta=1 \quad \text{and} \quad \langle G, Z \rangle_F = \lVert G \rVert_\gamma

The dualizer identifies the direction in the unit ball of β\lVert \cdot \rVert_\beta that is most aligned with GG.

Definition 2.5. (Continuous-time gradient flow for an optimizer) The continuous-time gradient flow for an optimizer is defined by moving in the direction of steepest descent, which is precisely the negative of the dualized gradient. For a chosen subgradient GtXL(Xt)G_t \in \partial_X L(X_t), the dynamics are:

dXtdt=ηdualizerβ(Gt) \frac{dX_t}{dt} = -\eta \cdot \text{dualizer}_{\lVert \cdot \rVert_\beta}(G_t)

where η>0\eta > 0 is a constant learning rate that scales the step size.

A Convergence Theorem

Theorem 3.1. For the optimization problem with a loss norm α\lVert \cdot \rVert_\alpha that is either a Schatten-p norm (p[1,]p \in [1, \infty]) or an element-wise p-norm (p[1,]p \in [1, \infty]), any optimizer defined by the continuous-time gradient flow under a chosen "geometry"-inducing norm β\lVert \cdot \rVert_\beta converges to the unique global minimum X*X^* in a finite amount of time, TfinalT_{\text{final}}. This time is bounded as follows:

L(X0)ηCupperTfinalL(X0)ηClower \frac{L(X_0)}{\eta \cdot C_{\text{upper}}} \le T_{\text{final}} \le \frac{L(X_0)}{\eta \cdot C_{\text{lower}}}

where L(X0)=AX0BαL(X_0) = \lVert AX_0 - B \rVert_{\alpha} is the initial loss, γ\lVert \cdot \rVert_\gamma is the dual norm to the optimizer's norm β\lVert \cdot \rVert_\beta, and δ\lVert \cdot \rVert_\delta is the dual norm to the loss norm α\lVert \cdot \rVert_\alpha. The constants ClowerC_{\text{lower}} and CupperC_{\text{upper}} are defined by the interaction of the matrix AA with the chosen geometries:

ClowerinfSn×m:Sδ=1ASγandCuppersupSn×m:Sδ=1ASγ C_{\text{lower}} \coloneqq \inf_{S \in \mathbb{R}^{n \times m} : \lVert S\rVert _{\delta}=1} \lVert AS \rVert_\gamma \quad \text{and} \quad C_{\text{upper}} \coloneqq \sup_{S \in \mathbb{R}^{n \times m} : \lVert S\rVert _{\delta}=1} \lVert AS \rVert_\gamma

(Since AA is SPD, we use AA in place of its transpose ATA^T.)

Proof. The proof proceeds in three steps: defining a potential function and its time derivative, bounding this derivative, and integrating the resulting differential inequality to find the time to convergence.

The Potential Function and its Time Derivative. We choose the objective function itself as a Lyapunov potential function: V(t)L(Xt)=AXtBαV(t) \coloneqq L(X_t) = \lVert AX_t - B \rVert_{\alpha}. Its time derivative is the directional derivative in the direction of the flow, given by the inner product of a subgradient GtG_t with the velocity vector dXtdt\frac{dX_t}{dt}.

dVdt=Gt,dXtdtF=Gt,ηdualizerβ(Gt)F=ηGt,dualizerβ(Gt)F \frac{dV}{dt} = \left\langle G_t, \frac{dX_t}{dt} \right\rangle_F = \left\langle G_t, -\eta \cdot \text{dualizer}_{\lVert \cdot \rVert_\beta}(G_t) \right\rangle_F = -\eta \left\langle G_t, \text{dualizer}_{\lVert \cdot \rVert_\beta}(G_t) \right\rangle_F

By the definition of the dualizer, Gt,dualizerβ(Gt)F=Gtγ\left\langle G_t, \text{dualizer}_{\lVert \cdot \rVert_\beta}(G_t) \right\rangle_F = \lVert G_t \rVert_\gamma. Thus, we have a simple expression for the rate of change:

dVdt=ηGtγ \frac{dV}{dt} = -\eta \lVert G_t \rVert_\gamma

This shows that the potential function is non-increasing as long as the gradient is non-zero.

Bounding the Rate of Change. From the equation on subgradients, any subgradient GtG_t can be expressed as Gt=AStG_t = A S_t for some matrix Stn×mS_t \in \mathbb{R}^{n \times m} which satisfies Stδ=1\lVert S_t \rVert_{\delta} = 1. Substituting this into our expression for the derivative gives:

dVdt=ηAStγ,where the matrix St satisfies Stδ=1. \frac{dV}{dt} = -\eta \lVert A S_t \rVert_\gamma, \quad \text{where the matrix } S_t \text{ satisfies } \lVert S_t \rVert_{\delta} = 1.

The rate of decay dVdt\frac{dV}{dt} depends on the matrix StS_t, which is always constrained to lie on the unit sphere of the norm δ\lVert \cdot \rVert_\delta. Therefore, the rate of change must lie between fixed bounds determined by the minimum and maximum possible values of ASγ\lVert A S \rVert_\gamma over this sphere.

ηsupS:Sδ=1ASγdVdtηinfS:Sδ=1ASγ -\eta \sup_{S:\lVert S\rVert _{\delta}=1} \lVert AS \rVert_\gamma \le \frac{dV}{dt} \le -\eta \inf_{S:\lVert S\rVert _{\delta}=1} \lVert AS \rVert_\gamma

Using the definitions of ClowerC_{\text{lower}} and CupperC_{\text{upper}} from the theorem statement, we obtain the inequality:

ηCupperdVdtηClower -\eta C_{\text{upper}} \le \frac{dV}{dt} \le -\eta C_{\text{lower}}

Since AA is SPD, its null space is trivial, so for any non-zero SS, ASAS is non-zero. Since norms are positive for non-zero inputs, ClowerC_{\text{lower}} is strictly positive.

Integration and Finite-Time Convergence. The differential inequality dVdtηClower\frac{dV}{dt} \le -\eta C_{\text{lower}} guarantees that V(t)V(t) decreases at a rate of at least ηClower\eta C_{\text{lower}}. To find an upper bound on the time to convergence, we integrate this from t=0t=0 to TfinalT_{\text{final}}:

V(0)V(Tfinal)dV0TfinalηClowerdt \int_{V(0)}^{V(T_{\text{final}})} dV \le \int_{0}^{T_{\text{final}}} -\eta C_{\text{lower}} \, dt

Convergence occurs when V(Tfinal)=0V(T_{\text{final}}) = 0. The initial value is V(0)=L(X0)V(0) = L(X_0).

0L(X0)ηClowerTfinalTfinalL(X0)ηClower 0 - L(X_0) \le -\eta C_{\text{lower}} \cdot T_{\text{final}} \implies T_{\text{final}} \le \frac{L(X_0)}{\eta C_{\text{lower}}}

This provides the upper bound on the convergence time. The lower bound is found by integrating the other side, dVdtηCupper\frac{dV}{dt} \ge -\eta C_{\text{upper}}:

L(X0)ηCupperTfinalTfinalL(X0)ηCupper -L(X_0) \ge -\eta C_{\text{upper}} \cdot T_{\text{final}} \implies T_{\text{final}} \ge \frac{L(X_0)}{\eta C_{\text{upper}}}

This establishes both bounds and completes the proof. \blacksquare

Convergence Bounds for Specific Scenarios

We now instantiate the general result of Theorem 3.1 for three specific optimizers. The idea is to derive bounds for ClowerC_{\text{lower}} and CupperC_{\text{upper}}.

Remark 4.1. Bounding CC

Our approach depends on the types of norms involved.

  • Schatten-Schatten Norms: When both the optimizer's dual norm (γ\gamma) and the loss's dual norm (δ\delta) are Schatten norms, we can use a direct and powerful inequality: for any Schatten norm Sr\lVert\cdot\rVert_{S_r}, λmin(A)SSrASSrλmax(A)SSr\lambda_{\min}(A) \lVert S \rVert_{S_r} \le \lVert AS \rVert_{S_r} \le \lambda_{\max}(A) \lVert S \rVert_{S_r}. This allows for a very tight analysis.
  • Mixed or Element-wise Norms: When at least one of the norms (γ\gamma or δ\delta) is an element-wise norm, the direct inequality above does not hold. In these cases, we use the Frobenius norm (F\lVert \cdot \rVert_F) as a bridge (for convenience; there are stronger results for certain cases), as it is both a Schatten and an element-wise norm. We relate the norms γ\gamma and δ\delta to the Frobenius norm and use the fact that λmin(A)SFASFλmax(A)SF\lambda_{\min}(A) \lVert S \rVert_F \le \lVert AS \rVert_F \le \lambda_{\max}(A) \lVert S \rVert_F.

There are more powerful ways of bounding these norms, but we will refrain from doing so for the sake of simplicity.

The following lemmas on norm equivalence, which are direct consequences of the generalized mean inequality, will be useful.

Lemma 4.2. (Schatten norm equivalence) Let Sn×mS \in \mathbb{R}^{n \times m} be any matrix with k=min(n,m)k = \min(n, m). For any 1ab1 \le a \le b \le \infty, the Schatten norms of SS are related by:

SSbSSak1a1bSSb \lVert S \rVert_{S_b} \le \lVert S \rVert_{S_a} \le k^{\frac{1}{a}-\frac{1}{b}} \lVert S \rVert_{S_b}

Lemma 4.3. (Element-wise norm equivalence) Let Sn×mS \in \mathbb{R}^{n \times m}. For any 1ab1 \le a \le b \le \infty, the element-wise norms of SS are related by:

Se,bSe,a(mn)1a1bSe,b \lVert S \rVert_{e,b} \le \lVert S \rVert_{e,a} \le (mn)^{\frac{1}{a}-\frac{1}{b}} \lVert S \rVert_{e,b}

Normalized Gradient Descent

We will consider the variant of Normalized Gradient Descent (NGD) which divides the gradient throughout by its Frobenius norm. That is,

dXtdt=ηGtGtF \frac{dX_t}{dt} = -\eta \frac{G_t}{\lVert G_t \rVert _F}

Let's try to identify the geometry of the optimizer. The dual norm is Gtγ=Gt,GtGtFF=GtF\lVert G_t \rVert _\gamma = \langle G_t, \frac{G_t}{\lVert G_t \rVert _F} \rangle_F = \lVert G_t \rVert _F. Since the Frobenius norm is self-dual, the optimizer's defining norm is also Frobenius.

Case 1: Schatten-p Norm Loss

Here α=Sp\lVert \cdot \rVert_\alpha = \lVert \cdot \rVert_{S_p} and its dual is δ=Sq\lVert \cdot \rVert_\delta = \lVert \cdot \rVert_{S_q}. Since both γ\gamma and δ\delta are Schatten norms, we use the direct analysis.

We need to bound Clower/upper=inf/supSSq=1ASFC_{\text{lower/upper}} = \inf/\sup_{\lVert S\rVert _{S_q}=1} \lVert AS \rVert_F. Using the direct bound: λmin(A)SFASFλmax(A)SF\lambda_{\min}(A)\lVert S \rVert_F \le \lVert AS \rVert_F \le \lambda_{\max}(A)\lVert S \rVert_F. This gives Clowerλmin(A)infSSq=1SFC_{\text{lower}} \ge \lambda_{\min}(A) \inf_{\lVert S\rVert _{S_q}=1} \lVert S \rVert_F and Cupperλmax(A)supSSq=1SFC_{\text{upper}} \le \lambda_{\max}(A) \sup_{\lVert S\rVert _{S_q}=1} \lVert S \rVert_F. We use Lemma 4.2 to find the inf/sup of SF=SS2\lVert S \rVert_F = \lVert S \rVert_{S_2} given SSq=1\lVert S \rVert_{S_q}=1.

If p2p \ge 2 (so 1q21 \le q \le 2): We compare norms SqS_q and S2S_2. Here a=q,b=2a=q, b=2. k121qSSqSS2SSqk^{\frac{1}{2}-\frac{1}{q}} \lVert S \rVert_{S_q} \le \lVert S \rVert_{S_2} \le \lVert S \rVert_{S_q}. With SSq=1\lVert S \rVert_{S_q}=1, we have infSF=k121q\inf \lVert S \rVert_F = k^{\frac{1}{2}-\frac{1}{q}} and supSF=1\sup \lVert S \rVert_F = 1. Clowerλmin(A)k121qC_{\text{lower}} \ge \lambda_{\min}(A) k^{\frac{1}{2}-\frac{1}{q}} and Cupperλmax(A)C_{\text{upper}} \le \lambda_{\max}(A).

If p2p \le 2 (so q2q \ge 2): We compare norms S2S_2 and SqS_q. Here a=2,b=qa=2, b=q. SSqSS2k121qSSq\lVert S \rVert_{S_q} \le \lVert S \rVert_{S_2} \le k^{\frac{1}{2}-\frac{1}{q}} \lVert S \rVert_{S_q}. With SSq=1\lVert S \rVert_{S_q}=1, we have infSF=1\inf \lVert S \rVert_F = 1 and supSF=k121q\sup \lVert S \rVert_F = k^{\frac{1}{2}-\frac{1}{q}}. Clowerλmin(A)C_{\text{lower}} \ge \lambda_{\min}(A) and Cupperλmax(A)k121qC_{\text{upper}} \le \lambda_{\max}(A) k^{\frac{1}{2}-\frac{1}{q}}.

Corollary 4.4. For NGD with Schatten-pp loss:

  • If p2p \ge 2 (1q21 \le q \le 2):

    AX0BSpηλmax(A)TfinalAX0BSpηλmin(A)k121q \frac{\lVert AX_0 - B \rVert_{S_p}}{\eta \lambda_{\max}(A)} \le T_{\text{final}} \le \frac{\lVert AX_0 - B \rVert_{S_p}}{\eta \lambda_{\min}(A) k^{\frac{1}{2}-\frac{1}{q}}}

  • If p2p \le 2 (q2q \ge 2):

    AX0BSpηλmax(A)k121qTfinalAX0BSpηλmin(A) \frac{\lVert AX_0 - B \rVert_{S_p}}{\eta \lambda_{\max}(A) k^{\frac{1}{2}-\frac{1}{q}}} \le T_{\text{final}} \le \frac{\lVert AX_0 - B \rVert_{S_p}}{\eta \lambda_{\min}(A)}

Case 2: Element-wise p-Norm Loss

The proof is perfectly analogous to the Schatten case, but since the loss norm is element-wise, we will use the Frobenius-bridging method. The bounds for SF\lVert S \rVert_F are found using Lemma 4.3, which replaces kk with mnmn.

Corollary 4.5. For NGD with element-wise pp-norm loss:

  • If p2p \ge 2 (1q21 \le q \le 2):

    AX0Be,pηλmax(A)TfinalAX0Be,pηλmin(A)(mn)121q \frac{\lVert AX_0 - B \rVert_{e,p}}{\eta \lambda_{\max}(A)} \le T_{\text{final}} \le \frac{\lVert AX_0 - B \rVert_{e,p}}{\eta \lambda_{\min}(A) (mn)^{\frac{1}{2}-\frac{1}{q}}}

  • If p2p \le 2 (q2q \ge 2):

    AX0Be,pηλmax(A)(mn)121qTfinalAX0Be,pηλmin(A) \frac{\lVert AX_0 - B \rVert_{e,p}}{\eta \lambda_{\max}(A) (mn)^{\frac{1}{2}-\frac{1}{q}}} \le T_{\text{final}} \le \frac{\lVert AX_0 - B \rVert_{e,p}}{\eta \lambda_{\min}(A)}

Muon

Note that Muon orthogonalizes the gradient, i.e., if Gt=UtΣtVtTG_t = U_t \Sigma_t V_t^T, then dXtdt=ηUtVtT\frac{dX_t}{dt} = -\eta \cdot U_t V_t^T. Let's try to see whether this fits into our definition. The dual norm would be Gtγ=UtΣtVtT,UtVtTF=Tr(VtUtTUtΣtVtT)=Tr(VtΣtVtT)=Tr(Σt)=Gt*\lVert G_t \rVert _\gamma = \langle U_t \Sigma_t V_t^T, U_t V_t^T \rangle_F = \text{Tr}(V_t U_t^T U_t \Sigma_t V_t^T) = \text{Tr}(V_t \Sigma_t V_t^T) = \text{Tr}(\Sigma_t) = \lVert G_t \rVert _*. So the dual norm is the Nuclear norm (S1S_1). Hence the defining norm is the spectral norm (SS_\infty).

Case 1: Schatten-p Norm Loss

Here, the loss norm is α=Sp\lVert \cdot \rVert_\alpha = \lVert \cdot \rVert_{S_p} and its dual is δ=Sq\lVert \cdot \rVert_\delta = \lVert \cdot \rVert_{S_q}. Both the optimizer's dual norm (γ=S1\gamma=S_1) and the loss's dual norm (δ=Sq\delta=S_q) are Schatten norms, so we use the direct analysis for a tighter bound.

We need to bound Clower/upper=inf/supSSq=1ASS1C_{\text{lower/upper}} = \inf/\sup_{\lVert S\rVert _{S_q}=1} \lVert AS \rVert_{S_1}. The direct method gives λmin(A)SS1ASS1λmax(A)SS1\lambda_{\min}(A)\lVert S \rVert_{S_1} \le \lVert AS \rVert_{S_1} \le \lambda_{\max}(A)\lVert S \rVert_{S_1}. Thus, Clowerλmin(A)infSSq=1SS1C_{\text{lower}} \ge \lambda_{\min}(A) \inf_{\lVert S\rVert _{S_q}=1} \lVert S \rVert_{S_1} and Cupperλmax(A)supSSq=1SS1C_{\text{upper}} \le \lambda_{\max}(A) \sup_{\lVert S\rVert _{S_q}=1} \lVert S \rVert_{S_1}. We use Lemma 4.2 to compare the S1S_1 and SqS_q norms. Since q1q \ge 1, we always have a=1,b=qa=1, b=q. The inequality is SSqSS1k11qSSq\lVert S \rVert_{S_q} \le \lVert S \rVert_{S_1} \le k^{1-\frac{1}{q}} \lVert S \rVert_{S_q}. Given SSq=1\lVert S \rVert_{S_q} = 1, we get 1SS1k11q1 \le \lVert S \rVert_{S_1} \le k^{1-\frac{1}{q}}. Therefore, infSS1=1\inf \lVert S \rVert_{S_1} = 1 and supSS1=k11q\sup \lVert S \rVert_{S_1} = k^{1-\frac{1}{q}}. This yields the bounds: Clowerλmin(A)C_{\text{lower}} \ge \lambda_{\min}(A) and Cupperλmax(A)k11qC_{\text{upper}} \le \lambda_{\max}(A) k^{1-\frac{1}{q}}. Noting that 11/q=1/p1-1/q = 1/p, the factor can be written as k1/pk^{1/p}.

Corollary 4.6. For Muon with Schatten-pp loss, for any p[1,]p \in [1, \infty], the convergence time is bounded by:

AX0BSpηλmax(A)k1/pTfinalAX0BSpηλmin(A) \frac{\lVert AX_0 - B \rVert_{S_p}}{\eta \lambda_{\max}(A) k^{1/p}} \le T_{\text{final}} \le \frac{\lVert AX_0 - B \rVert_{S_p}}{\eta \lambda_{\min}(A)}

Case 2: Element-wise p-Norm Loss

Here α=e,p\lVert \cdot \rVert_\alpha = \lVert \cdot \rVert_{e,p}, so δ=e,q\lVert \cdot \rVert_\delta = \lVert \cdot \rVert_{e,q}. The optimizer norm is γ=S1\gamma=S_1. This is a mixed-norm case, so we must use the Frobenius bridge.

We bound Clower/upper=inf/supSe,q=1ASS1C_{\text{lower/upper}} = \inf/\sup_{\lVert S\rVert _{e,q}=1} \lVert AS \rVert_{S_1}. We use the inequalities ZFZS1kZF\lVert Z \rVert_F \le \lVert Z \rVert_{S_1} \le \sqrt{k} \lVert Z \rVert_F. ClowerinfASFλmin(A)infSFC_{\text{lower}} \ge \inf \lVert AS \rVert_F \ge \lambda_{\min}(A) \inf \lVert S \rVert_F. CuppersupkASFkλmax(A)supSFC_{\text{upper}} \le \sup \sqrt{k} \lVert AS \rVert_F \le \sqrt{k} \lambda_{\max}(A) \sup \lVert S \rVert_F. The infimum and supremum are over the set {SSe,q=1}\{S \mid \lVert S \rVert_{e,q}=1\}, and are found using Lemma 4.3 relating e,q\lVert \cdot \rVert_{e,q} and F=e,2\lVert \cdot \rVert_F = \lVert \cdot \rVert_{e,2}.

Corollary 4.7. For Muon with element-wise pp-norm loss:

  • If p2p \ge 2 (1q21 \le q \le 2):

    AX0Be,pηkλmax(A)TfinalAX0Be,pηλmin(A)(mn)121q \frac{\lVert AX_0 - B \rVert_{e,p}}{\eta \sqrt{k} \lambda_{\max}(A)} \le T_{\text{final}} \le \frac{\lVert AX_0 - B \rVert_{e,p}}{\eta \lambda_{\min}(A) (mn)^{\frac{1}{2}-\frac{1}{q}}}

  • If p2p \le 2 (q2q \ge 2):

    AX0Be,pηkλmax(A)(mn)121qTfinalAX0Be,pηλmin(A) \frac{\lVert AX_0 - B \rVert_{e,p}}{\eta \sqrt{k} \lambda_{\max}(A) (mn)^{\frac{1}{2}-\frac{1}{q}}} \le T_{\text{final}} \le \frac{\lVert AX_0 - B \rVert_{e,p}}{\eta \lambda_{\min}(A)}

Adam/SignSGD

As usual, let's try to see whether this algorithm fits in our formulation. The update is dXtdt=ηsign(Gt)\frac{dX_t}{dt} = -\eta \cdot \text{sign}(G_t). The dual norm would be Gtγ=Gt,sign(Gt)F=Tr(GtTsign(Gt))=Gte,1\lVert G_t \rVert _\gamma = \langle G_t, \text{sign}(G_t) \rangle _F = \text{Tr}(G_t^T \text{sign}(G_t)) = \lVert G_t \rVert_{e, 1}, the element-wise L1L_1 norm. The geometry is thus induced by the element-wise LL_\infty norm, e,\lVert \cdot \rVert_{e, \infty}.

Remark 4.8. SignSGD can be viewed as a limiting case of the Adam optimizer where both moment estimates rapidly converge to the current gradient and its square.

Case 1: Schatten-p Norm Loss

Here α=Sp\lVert \cdot \rVert_\alpha = \lVert \cdot \rVert_{S_p} and δ=Sq\lVert \cdot \rVert_\delta = \lVert \cdot \rVert_{S_q}. This is a mixed-norm case.

We need to bound Clower/upper=inf/supSSq=1ASe,1C_{\text{lower/upper}} = \inf/\sup_{\lVert S\rVert _{S_q}=1} \lVert AS \rVert_{e,1}. For the lower bound, we use the Frobenius bridge: ClowerinfASFλmin(A)infSSq=1SFC_{\text{lower}} \ge \inf \lVert AS \rVert_F \ge \lambda_{\min}(A) \inf_{\lVert S\rVert _{S_q}=1} \lVert S \rVert_F. For the upper bound, we can use a tighter bridge than the Frobenius norm. We use the chain of inequalities Ze,1kZ*\lVert Z \rVert_{e,1} \le \sqrt{k} \lVert Z \rVert_* and Z*kZF\lVert Z \rVert_* \le \sqrt{k} \lVert Z \rVert_F. Cupper=supASe,1supkAS*kλmax(A)supSSq=1S*C_{\text{upper}} = \sup \lVert AS \rVert_{e,1} \le \sup \sqrt{k}\lVert AS \rVert_* \le \sqrt{k} \lambda_{\max}(A) \sup_{\lVert S\rVert _{S_q}=1} \lVert S \rVert_*. We need the inf/sup of SF\lVert S \rVert_F and S*\lVert S \rVert_* over the SqS_q-sphere. From Lemma 4.2: supSSq=1S*=supSS1=k11/q=k1/p\sup_{\lVert S\rVert _{S_q}=1} \lVert S \rVert_* = \sup \lVert S \rVert_{S_1} = k^{1-1/q} = k^{1/p}. This gives Cupperkλmax(A)k1/pC_{\text{upper}} \le \sqrt{k} \lambda_{\max}(A) k^{1/p} for all pp.

The bounds for infSF\inf \lVert S \rVert_F depend on qq's relation to 2, as derived for NGD.

If p2p \ge 2 (1q21 \le q \le 2): infSF=k121q\inf \lVert S \rVert_F = k^{\frac{1}{2}-\frac{1}{q}}.

If p2p \le 2 (q2q \ge 2): infSF=1\inf \lVert S \rVert_F = 1.

Corollary 4.9. For Adam/SignSGD with Schatten-pp loss:

  • If p2p \ge 2 (1q21 \le q \le 2):

    AX0BSpηkλmax(A)k1/pTfinalAX0BSpηλmin(A)k121q \frac{\lVert AX_0 - B \rVert_{S_p}}{\eta \sqrt{k} \lambda_{\max}(A) k^{1/p}} \le T_{\text{final}} \le \frac{\lVert AX_0 - B \rVert_{S_p}}{\eta \lambda_{\min}(A) k^{\frac{1}{2}-\frac{1}{q}}}

  • If p2p \le 2 (q2q \ge 2):

    AX0BSpηkλmax(A)k1/pTfinalAX0BSpηλmin(A) \frac{\lVert AX_0 - B \rVert_{S_p}}{\eta \sqrt{k} \lambda_{\max}(A) k^{1/p}} \le T_{\text{final}} \le \frac{\lVert AX_0 - B \rVert_{S_p}}{\eta \lambda_{\min}(A)}

Case 2: Element-wise p-Norm Loss

Here α=e,p\lVert \cdot \rVert_\alpha = \lVert \cdot \rVert_{e,p}, so δ=e,q\lVert \cdot \rVert_\delta = \lVert \cdot \rVert_{e,q}.

We need to bound Clower/upper=inf/supSe,q=1ASe,1C_{\text{lower/upper}} = \inf/\sup_{\lVert S\rVert _{e,q}=1} \lVert AS \rVert_{e,1}. For the lower bound: ClowerinfASFλmin(A)infSe,q=1SFC_{\text{lower}} \ge \inf \lVert AS \rVert_F \ge \lambda_{\min}(A) \inf_{\lVert S\rVert _{e,q}=1} \lVert S \rVert_F. For the upper bound, we use the tighter bridge Ze,1kZF\lVert Z \rVert_{e,1} \le k \lVert Z \rVert_F, which comes from Ze,1kZ*kkZF\lVert Z \rVert_{e,1} \le \sqrt{k} \lVert Z \rVert_* \le \sqrt{k}\sqrt{k} \lVert Z \rVert_F. This is tighter than the standard mn\sqrt{mn} factor. CuppersupkASFkλmax(A)supSe,q=1SFC_{\text{upper}} \le \sup k \lVert AS \rVert_F \le k \lambda_{\max}(A) \sup_{\lVert S\rVert _{e,q}=1} \lVert S \rVert_F. The required inf/sup of SF=Se,2\lVert S \rVert_F = \lVert S \rVert_{e,2} over the e,qe,q-sphere come from Lemma 4.3.

If p2p \ge 2 (1q21 \le q \le 2): infSF=(mn)121q\inf \lVert S \rVert_F = (mn)^{\frac{1}{2}-\frac{1}{q}} and supSF=1\sup \lVert S \rVert_F = 1.

If p2p \le 2 (q2q \ge 2): infSF=1\inf \lVert S \rVert_F = 1 and supSF=(mn)121q\sup \lVert S \rVert_F = (mn)^{\frac{1}{2}-\frac{1}{q}}.

Corollary 4.10. For Adam/SignSGD with element-wise pp-norm loss:

  • If p2p \ge 2 (1q21 \le q \le 2):

    AX0Be,pηkλmax(A)TfinalAX0Be,pηλmin(A)(mn)121q \frac{\lVert AX_0 - B \rVert_{e,p}}{\eta k\lambda_{\max}(A)} \le T_{final} \le \frac{\lVert AX_0 - B \rVert_{e,p}}{\eta \lambda_{\min}(A) (mn)^{\frac{1}{2}-\frac{1}{q}}}

  • If p2p \le 2 (q2q \ge 2):

    AX0Be,pηkλmax(A)(mn)121qTfinalAX0Be,pηλmin(A) \frac{\lVert AX_0 - B \rVert_{e,p}}{\eta k\lambda_{\max}(A)(mn)^{\frac{1}{2}-\frac{1}{q}}} \le T_{final} \le \frac{\lVert AX_0 - B \rVert_{e,p}}{\eta \lambda_{\min}(A)}

A Note on Bound Tightness and Rate Variation

This note analyzes two key properties of the derived bounds: the potential variation in the convergence rate and the conditions under which the bounds are tight.

Analysis of Rate Variation

The ratio Cupper/ClowerC_{\text{upper}} / C_{\text{lower}} measures the stability of the convergence rate. A value near one implies a nearly constant rate, while a large value indicates substantial possible variation (both in the current path as well as distinct starting guesses). Let's analyze this for the direct Schatten-Schatten analysis, for example, Muon (γ=S1\gamma=S_1) with Schatten loss (δ=Sq\delta=S_q).

CupperClower=κ(A)k1/p \frac{C_{\text{upper}}}{C_{\text{lower}}} = \kappa(A) k^{1/p}

In this case, the rate variation depends on the conditioning of AA and a dimensional factor related to the loss geometry. For the more complex bridged analyses, the ratio will also include factors from the norm equivalences used. The general structure holds: rate variation is a product of contributions from the problem's conditioning (κ(A)\kappa(A)), the optimizer's geometry, and the loss function's geometry. In general, the closer the problem geometry is to the optimizer, the better the convergence guarantees there will be --- after all, we are doing some kind of steepest descent under some matrix norm.

Let's compare this to the corresponding bound for the Adam case:

CupperClower=κ(A)k1/p+1/min(2,q) \frac{C_{\text{upper}}}{C_{\text{lower}}} = \kappa(A) k^{1/p + 1/\min(2, q)}

For any Schatten-p norm, we see that Muon gives better convergence guarantees (according to our analysis) as compared to Adam, though it should be kept in mind that the bounds for Adam are not as tight.

Analysis of Bound Tightness

We now investigate whether our bounds on ClowerC_{\text{lower}} and CupperC_{\text{upper}} are achievable.

Case 1: Schatten-Schatten Analysis (Direct)

This case applies when both the optimizer's dual norm (γ\gamma) and the loss's dual norm (δ\delta) are Schatten norms (e.g., Muon with Schatten-p loss). The analysis uses the direct inequality λmin(A)SγASγλmax(A)Sγ\lambda_{\min}(A)\lVert S \rVert_{\gamma} \le \lVert AS \rVert_{\gamma} \le \lambda_{\max}(A)\lVert S \rVert_{\gamma} followed by a norm equivalence bound.

Lower Bound (ClowerC_{\text{lower}}): Our bound is Clowerλmin(A)infSδ=1SγC_{\text{lower}} \ge \lambda_{\min}(A) \inf_{\lVert S\lVert _{\delta}=1} \lVert S \rVert_\gamma. To achieve this bound, a single matrix SS must satisfy two conditions simultaneously:

  • ASγ=λmin(A)Sγ\lVert AS \rVert_\gamma = \lambda_{\min}(A) \lVert S \rVert_\gamma: This is achieved if SS lies in an eigenspace of AA associated with the minimum eigenvalue, λmin(A)\lambda_{\min}(A). For example, if S=uvTS=uv^T where u,vu,v are eigenvectors for λmin(A)\lambda_{\min}(A).
  • SS must be a matrix that achieves the infimum of Sγ\lVert S \rVert_\gamma over the unit sphere of δ\lVert \cdot \rVert_\delta. For Muon (γ=S1\gamma=S_1) and any Schatten loss (δ=Sq\delta=S_q), this infimum is 1 and is achieved by any rank-one matrix.

A rank-one matrix S=uvTS=uv^T constructed from eigenvectors of λmin(A)\lambda_{\min}(A) satisfies both conditions. Therefore, the lower bound is tight.

Upper Bound (CupperC_{\text{upper}}): Our bound is Cupperλmax(A)supSδ=1SγC_{\text{upper}} \le \lambda_{\max}(A) \sup_{\lVert S\lVert _{\delta}=1} \lVert S \rVert_\gamma. This requires a matrix SS to satisfy:

  • ASγ=λmax(A)Sγ\lVert AS \rVert_\gamma = \lambda_{\max}(A) \lVert S \rVert_\gamma: This requires SS to lie in the eigenspace of λmax(A)\lambda_{\max}(A).
  • SS must achieve the supremum of Sγ\lVert S \rVert_\gamma over the unit sphere of δ\lVert \cdot \rVert_\delta. For Muon (γ=S1\gamma=S_1, δ=Sq\delta=S_q), this supremum is k1/pk^{1/p} and is achieved by a matrix whose non-zero singular values are all equal (a "spectrally flat" (or whitened) matrix).

This is achievable if the eigenspace of λmax(A)\lambda_{\max}(A) is large enough (i.e., its multiplicity is at least k=min(n,m)k=\min(n,m)) to construct a spectrally flat, rank-kk matrix within it. If λmax(A)\lambda_{\max}(A) is a simple (non-repeated) eigenvalue, this is generally not possible for k>1k>1. Thus, the upper bound is tight under certain conditions on AA, but may be conservative otherwise.

Case 2: Element-wise/Element-wise Analysis (Bridged)

This case applies to Adam/SignSGD with an element-wise p-norm loss. The analysis uses the Frobenius norm as a bridge.

Lower Bound (Clowerλmin(A)infSe,q=1SFC_{\text{lower}} \ge \lambda_{\min}(A) \inf_{\lVert S\lVert _{e,q}=1} \lVert S \rVert_F):

  • ASF=λmin(A)SF\lVert AS \rVert_F = \lambda_{\min}(A) \lVert S \rVert_F: Requires SS to lie in the eigenspace of λmin(A)\lambda_{\min}(A).
  • SS must achieve infSe,q=1SF\inf_{\lVert S\lVert _{e,q}=1} \lVert S \rVert_F. This is achieved by a matrix with only one non-zero entry (an extremely sparse matrix, e.g., S=EijS=E_{ij}).

The bound is tight only if a basis matrix EijE_{ij} is an eigenvector of AA. This only occurs if AA is a diagonal matrix. For general non-diagonal AA, the conditions are in conflict, making the lower bound conservative.

Upper Bound (Cupperλmax(A)supSe,q=1SFC_{\text{upper}} \le \lambda_{\max}(A) \sup_{\lVert S\lVert _{e,q}=1} \lVert S \rVert_F):

  • ASF=λmax(A)SF\lVert AS \rVert_F = \lambda_{\max}(A) \lVert S \rVert_F: Requires SS to lie in the eigenspace of λmax(A)\lambda_{\max}(A).
  • SS must achieve supSe,q=1SF\sup_{\lVert S\lVert _{e,q}=1} \lVert S \rVert_F. This is achieved by a matrix whose non-zero entries all have the same absolute value (a "dense" matrix, like the all-ones matrix JJ).

This is tight only if a matrix of constant-magnitude entries (like JJ) is an eigenvector of AA. This is true for highly structured matrices (e.g., A=c1I+c2JA=c_1I + c_2J) but not for general AA. The upper bound is therefore also conservative.

  • Case 3: Mixed-Norm Analysis (Bridged)

    This case applies to Adam/SignSGD with Schatten loss or Muon with element-wise loss. The analysis involves inequalities bridging between Schatten and element-wise norms (e.g., Ze,1kZ*\lVert Z \rVert_{e,1} \le \sqrt{k} \lVert Z \rVert_* or ZFZ*\lVert Z \rVert_F \le \lVert Z \rVert_*).

    Lower Bound (ClowerC_{\text{lower}}): Consider Adam/SignSGD with Schatten loss, where Clowerλmin(A)infSSq=1SFC_{\text{lower}} \ge \lambda_{\min}(A) \inf_{\lVert S\lVert _{S_q}=1} \lVert S \rVert_F.

    • ASF=λmin(A)SF\lVert AS \rVert_F = \lambda_{\min}(A) \lVert S \rVert_F: Requires SS to be in the λmin(A)\lambda_{\min}(A) eigenspace.
    • SS must achieve infSSq=1SF\inf_{\lVert S\lVert _{S_q}=1} \lVert S \rVert_F. This requires SS to be rank-one (if q2q \ge 2) or spectrally flat (if q<2q < 2).

    For q2q \ge 2, we require a rank-one matrix from the λmin(A)\lambda_{\min}(A) eigenspace, which is always possible. So for this sub-case, the bound is tight. For q<2q < 2, it requires a spectrally flat matrix in that eigenspace, which is only possible if λmin(A)\lambda_{\min}(A) is degenerate.

    Upper Bound (CupperC_{\text{upper}}): Consider Adam/SignSGD again, with the bound Cupperkλmax(A)supSSq=1S*C_{\text{upper}} \le \sqrt{k} \lambda_{\max}(A) \sup_{\lVert S\lVert _{S_q}=1} \lVert S \rVert_*.

    • ASe,1kAS*\lVert AS \rVert_{e,1} \le \sqrt{k} \lVert AS \rVert_*: The first bridging step. Equality requires ASAS to be a rank-one matrix whose entries have equal absolute magnitude.
    • AS*λmax(A)S*\lVert AS \rVert_* \le \lambda_{\max}(A) \lVert S \rVert_*: The spectral bound. Equality requires SS to be in the λmax(A)\lambda_{\max}(A) eigenspace.

    This requires that a matrix SS from the λmax(A)\lambda_{\max}(A) eigenspace produces a matrix ASAS that is both rank-one and has entries of equal magnitude. These are generally conflicting geometric requirements. If SS is rank-one, ASAS will be too, but requiring its entries to have equal magnitude is a very strong condition on AA. This bound is therefore typically very conservative.

    In summary, the direct analysis for matching Schatten norms yields tight or near-tight bounds. In contrast, any analysis that must bridge between element-wise and Schatten "geometries" introduces conflicting requirements, leading to bounds that are more likely to be conservative estimates.

Using This In Practice: A Comparison of Optimizers

The theoretical framework developed provides another lens through which to compare familiar optimization algorithms. By understanding their underlying geometry, we can move beyond empirical performance and attempt to reason about why they behave differently.

Let us reiterate the warning from the beginning - intuition built from toy problems can only go so far.

The Problem with Vanilla Gradient Descent

Standard Gradient Descent (GD) does not belong to the family of "geometric optimizers" analyzed here. Its continuous-time flow is given by:

dXtdt=ηGt=ηASt \frac{dX_t}{dt} = -\eta G_t = -\eta A S_t

Notice the absence of the dualizer\text{dualizer} map. The update magnitude is directly proportional to the norm of the gradient, Gt\lVert G_t \rVert. As the optimizer approaches the solution X*X^*, the residual AXtBAX_t-B shrinks, causing the gradient norm to approach zero. This leads to ever-slowing convergence.

For the classic case of a Frobenius norm loss (α=F\alpha=F), the gradient is Gt=A(AXtB)G_t = A(AX_t - B). The flow becomes a linear ODE, whose exact solution is:

XtX*=eA2ηt(X0X*) X_t - X^* = e^{-A^2 \eta t} (X_0 - X^*)

This explicitly shows that the error term decays exponentially but only vanishes as tt \to \infty. Standard GD offers infinite-time asymptotic convergence.

Geometric Optimizers as the Solution

The geometric optimizers studied in this post --- NGD, Muon, and Adam/SignSGD --- all "fix" this problem of vanishing updates. They achieve this by normalizing the gradient, ensuring the update step has a constant size with respect to a norm. The update rule dXtdt=ηdualizerβ(Gt)\frac{dX_t}{dt} = -\eta \cdot \text{dualizer}_{\lVert \cdot \rVert_\beta}(G_t) has the property that dXtdtβ=η\lVert \frac{dX_t}{dt} \rVert_\beta = \eta, a constant. This fixed-size progress in the chosen norm forces the optimizer to reach the zero-loss solution in a finite amount of time. However, if in an exceptionally smooth loss landscape (like the one we use), there will be oscillations in the discrete step size case.

Comparing the structure for Normalized Gradient Descent, Muon, and Adam/SignSGD

While all three ensure finite-time convergence, their performance characteristics, revealed by the convergence bounds, are quite different because they are associated with different norms.

  • Normalized Gradient Descent (NGD): This optimizer uses a Euclidean norm (β=F\beta=F). It normalizes the entire gradient matrix GtG_t, scaling it down to have a Frobenius norm of 1. It treats all entries and all singular values of the gradient democratically. It is a good baseline but may be blind to more nuanced structures in the problem.

  • Muon: This optimizer uses the spectral norm (β=S\beta=S_\infty). If the gradient has an SVD of Gt=UΣVTG_t = U\Sigma V^T, the update direction is the matrix UVT-UV^T. This update is an isometry that encapsulates the "pure orientation" of the gradient, stripped of all singular value scaling. It's a whitened update that respects the singular vector spaces of the gradient. This makes it highly sensitive to the spectral properties of the error, treating every singular direction present in the gradient with equal importance.

  • SignSGD and Adam: The SignSGD optimizer uses an element-wise norm (β=e,\beta=e,\infty, so γ=e,1\gamma=e,1). Its dualizer is the element-wise sign function, so every entry of the update matrix is either +1+1 or 1-1 (scaled by η\eta). This is closely related to the popular Adam optimizer. In Adam, the gradient is divided element-wise by the square root of its squared moving average (vtv_t). If the momentum terms are ignored (or in a quasi-stationary limit where the estimates have converged), the Adam update is approximately m̂t/(v̂t+ϵ)sign(Gt)\hat{m}_t / (\sqrt{\hat{v}_t} + \epsilon) \approx \text{sign}(G_t). Our analysis of SignSGD can therefore be seen as a theoretical characterization of an idealized Adam with ϵ=0\epsilon=0. This geometry considers each matrix entry independently, making it effective when errors are sparse or scattered without a clear spectral pattern.

A final note on momentum. While we have not studied it here, momentum can be added to any of these gradient flows by adding a coupled differential equation. It often accelerates convergence but can also introduce oscillations, as it gives the optimizer "mass" and inertia. It is possible to do such an analysis but we will leave it for another time, since it is mathematically much more involved. Gabriel Goh's article on momentum at distill.pub explains the key intuitions around momentum quite well, explaining how it can lead to as big as a square-root reduction in the "effective" condition number which has important consequences for convergence.

The Nature of the Updates

The choice of the geometry dictates the very structure of the update step, ΔX=ηdualizer(Gt)\Delta X = -\eta \cdot \text{dualizer}(G_t).

  • Muon (with β=S\beta=S_\infty, γ=S1\gamma=S_1) produces a full-rank, spectrally flat update matrix, UVTUV^T, where Gt=UΣVTG_t=U\Sigma V^T. This update preserves the singular vector spaces of the gradient but replaces all non-zero singular values with 1. It is a "pure rotation" (or isometry) in the direction of the error. It changes the orientation of the solution without preferentially scaling any spectral direction.

  • Adam/SignSGD (with β=e,\beta=e,\infty, γ=e,1\gamma=e,1) produces a dense, full-rank update matrix whose entries are all ±η\pm \eta. It modifies every single parameter of XX at every step, with the magnitude of the change being independent of the magnitude of the gradient for that entry.

  • NGD (with β=F\beta=F, γ=F\gamma=F) produces a full-rank update that is a scaled version of the gradient, preserving its relative structure completely.

However, in practice, this is a different situation (this is also why we should take all intuition from toy problems with a grain of salt). The gradients GtG_t produced by standard methods like SGD or Adam are often empirically ill-conditioned or "spectrally spiky". This means their singular value spectrum is dominated by a few very large values, with a long tail of small ones, making the gradient matrix effectively low-rank. Adam/SignSGD's approach is to bet on element-wise whitening, which doesn't work so well. Muon explicitly performs the spectral whitening (i.e., orthogonalization) which discards information from the singular values and updates based on all singular vectors equally --- the idea is to increase the contribution of rare but nevertheless important directions.

Also note that since the weights themselves are EMAs of gradients with weight decay, this might also lead to more effective use of the capacity of the network if the weight decay is too fast.

Building Intuition from the Convergence Bounds

The bounds derived in Theorem 3.1 and instantiated in Section 4 can provide some intuition about how to choose and reason about optimizers.

The Meaning of ClowerC_{\text{lower}} and CupperC_{\text{upper}}

The main result states that TfinalL(X0)/(ηC)T_{\text{final}} \approx L(X_0) / (\eta C). CC can be thought of as a speed. Specifically, C=ASγC = \lVert AS \rVert_\gamma measures how "large" the gradient step is in the optimizer's geometry (γ\gamma), given a unit-sized error "direction" SS in the loss's geometry (δ\delta).

The constants ClowerC_{\text{lower}} and CupperC_{\text{upper}} represent the worst-case and best-case scenarios for this rate.

  • ClowerC_{\text{lower}} is the slowest possible rate of convergence. A large ClowerC_{\text{lower}} gives a strong worst-case guarantee.
  • CupperC_{\text{upper}} is the fastest possible rate. It tells you the best you can hope for.

Rate Stability and Comparing Optimizers

The ratio Cupper/ClowerC_{\text{upper}} / C_{\text{lower}} is a measure of the stability or robustness of the convergence rate.

  • A ratio close to 1 means the optimizer makes steady, predictable progress regardless of the specific error structure at any given time/state, or the starting-point.
  • A large ratio implies the convergence can be highly variable: sometimes very fast, sometimes very slow. This happens when the optimizer's geometry is poorly matched to certain error structures.

When comparing two optimizers, vaguely speaking, one can normalize the learning rate η\eta such that their "update sizes" match (e.g., make ηCupper\eta C_{\text{upper}} or some measure of an average ηC\eta C or some norm of the update equal for both). The comparison then boils down to which one has a larger ClowerC_{\text{lower}}. The optimizer with the better worst-case guarantee (larger ClowerC_{\text{lower}}) is often preferable, as it is more reliable. For instance, our analysis for Schatten loss shows that the rate variation for Muon is κ(A)k1/p\kappa(A)k^{1/p}, while the corresponding (though potentially looser) bound for Adam/SignSGD includes additional factors, suggesting Muon's convergence rate might be more stable (or just have better mathematical properties). In the case of p=2p = 2, though, the Adam convergence time bounds are tight (the Frobenius norm is both a Schatten-p norm as well as an element-wise norm, so we can pick AA to be the matrices in the element-element analysis for SignSGD), so Muon is a better choice for this case. We can make this precise too, because after normalizing the learning rate with functions of the matrix size such that the Frobenius norm of the updates is reasonably constant on average (taken over all possible starting points XX), the lower bounds on time become constant for both optimizers, but the upper bounds vary at different rates, so the worst case scenario of Muon is better than that for Adam.

Also, ideally, if you know the geometry of the problem, the choice of the optimizer should be directly derived from that. If not, it may be chosen in a way that does not have a lot of deviation at any given point from the actual geometry of the problem, in some sense. If the choice is made this way, then the spectral norm (and hence Muon) in general seems to be a good choice.

Further reading

On the topic of metrized deep learning, a good introduction is the survey here - I would highly recommend doing a breadth-first search through all the references mentioned there to understand the main ideas better.

After I wrote this post, someone pointed out this paper which also talks about Muon's convergence rate (discrete steps, non-convex problems, but not necessarily finite-time) to local minima and compares it to gradient descent.

Cite this post
@online{optimizers-toy-problem-2025,
  author    = {nor},
  title     = {Theoretical properties of optimizers on a toy problem, and some intuition},
  year      = {2025},
  month     = {08},
  day       = {02},
  url       = {https://nor-blog.pages.dev/posts/2025-08-02-optimizers-toy-problem},
}