Introduction
We want to answer the question: how do optimizers fundamentally differ in their approach to finding a minimum? To explore this question in a controlled environment, we focus on a simple problem: minimizing the matrix error term \(\min \lVert AX - B \rVert\).
This problem serves as a toy task for theoretical exploration because it is tractable, in part due to its simplicity. By assuming the matrix \(A\) is symmetric and positive definite (SPD), we guarantee a unique, known solution \(X^* = A^{-1}B\), allowing us to measure things quite precisely. The problem is convex, which removes the complexities of local minima and lets us focus purely on the dynamics of convergence — we make this compromise because non-convexity makes theoretical analysis significantly harder, but in no way does this mean that non-convex problems are not important; most interesting optimization problems are non-convex. Furthermore, by allowing the error \(\lVert AX - B \rVert\) to be measured by a wide range of norms (both Schatten and element-wise), we can study how the problem’s “shape” influences optimizer performance.
Our analysis centers on a family of what we will informally call “geometric optimizers”. These are algorithms whose update direction is determined by normalizing the gradient in a manner according to a chosen matrix norm. This family includes well-known methods like Normalized Gradient Descent (which uses a standard Euclidean “geometry”), the Muon optimizer (which uses a spectral “geometry”), and SignSGD (which uses an element-wise “geometry” and is the special case of Adam without momentum).
The good news is that, for this problem, any optimizer in this family is guaranteed to find the exact solution to this toy problem in a finite amount of time. This stands in sharp contrast to standard gradient descent, which only approaches the solution asymptotically as time goes to infinity. We show that the total convergence time is bounded and is determined by the relationship between the optimizer’s chosen update norm and the problem’s loss norm. In particular, we will see how Muon’s theoretical convergence guarantees are nicer than Adam’s.
This post is structured to build this from the ground up. We will:
- Formally define the optimization problem and the class of geometric optimizers.
- Present a general theorem that provides upper and lower bounds on the convergence time.
- Apply this theorem to Normalized Gradient Descent, Muon, and Adam/SignSGD to derive concrete, comparable performance bounds.
- Using these results and otherwise, discuss the different motivations behind these optimizers and build some intuition about their behavior.
It is important to state at the outset that our analysis is set in an idealized continuous-time framework, which has a lot of simplifications at the very outset (no discrete step size pathologies, no noise, very simple loss landscape). The primary goal is to provide a mathematical lens for understanding why and how different optimizers behave differently. The intuition developed here might serve as a valuable foundation for thinking about their discrete-time and stochastic counterparts used in practice. Be warned, though — intuition built from toy problems can only go so far.
The Setting
The Optimization Problem
We investigate the following convex optimization problem:
\[ \min_{X \in \mathbb{R}^{n \times m}} L(X) \coloneqq \lVert AX - B \rVert_{\alpha} \]
Here, the matrix \(A \in \mathbb{R}^{n \times n}\) is assumed to be symmetric and positive definite (SPD). This is an important assumption, implying that \(A\) is invertible and all its eigenvalues are strictly positive. The matrices \(X \in \mathbb{R}^{n \times m}\) and \(B \in \mathbb{R}^{n \times m}\) represent the variable and the target, respectively. The objective function \(L(X)\) measures the “size” of the residual matrix \(AX - B\) using a chosen matrix norm, \(\lVert \cdot \rVert_{\alpha}\). Our analysis covers two primary families of norms: Schatten-p norms and element-wise p-norms.
Since \(A\) is invertible, a unique solution \(X^*\) exists that makes the residual zero: \(AX^* - B = 0\). This solution is \(X^* = A^{-1}B\). At this point, the loss function attains its global minimum value of \(L(X^*) = \lVert A(A^{-1}B) - B \rVert_{\alpha} = \lVert 0 \rVert_{\alpha} = 0\). The goal of an optimizer is to find this unique solution \(X^*\).
Definition 2.1. (Schatten-p norm) The Schatten-p norm, denoted \(\lVert \cdot \rVert_{S_p}\), is a norm on the space of matrices, defined via their singular values. For any matrix \(Z \in \mathbb{R}^{n \times m}\), let its singular values be \(\sigma_1(Z) \ge \sigma_2(Z) \ge \dots \ge \sigma_r(Z) > 0\), where \(r = \text{rank}(Z)\). The Schatten-p norm for \(p \in [1, \infty]\) is defined as:
\[ \lVert Z \rVert_{S_p} \coloneqq \left( \sum_{i=1}^r \sigma_i(Z)^p \right)^{1/p} \]
This is equivalent to the \(L_p\) norm of the vector of singular values. Notable cases include the Nuclear norm (\(p=1\), denoted \(\lVert \cdot \rVert_*\)), the Frobenius norm (\(p=2\), denoted \(\lVert \cdot \rVert_F\)), and the Spectral norm (\(p=\infty\)). The dual (we will explain this in a later section) of the Schatten-p norm is the Schatten-q norm, where \(p\) and \(q\) are Hölder conjugates, satisfying \(1/p + 1/q = 1\).
Definition 2.2. (Element-wise p-norm) The element-wise p-norm, denoted \(\lVert \cdot \rVert_{e,p}\), is a norm on the space of matrices defined by treating the matrix as a vector in \(\mathbb{R}^{mn}\). For any matrix \(Z \in \mathbb{R}^{n \times m}\), the element-wise p-norm for \(p \in [1, \infty]\) is defined as:
\[ \lVert Z \rVert_{e,p} \coloneqq \left( \sum_{i=1}^n \sum_{j=1}^m |Z_{ij}|^p \right)^{1/p} \]
This is the standard \(L_p\) norm of the vectorized matrix. The case \(p=2\) corresponds to the Frobenius norm, \(\lVert Z \rVert_{e,2} = \lVert Z \rVert_F\). The dual of the element-wise p-norm is the element-wise q-norm, where \(1/p + 1/q = 1\).
Subgradient of the Loss Function
To minimize \(L(X)\), we need its gradient. However, for \(p \neq 2\), neither the Schatten nor element-wise p-norms are differentiable everywhere. We therefore use the more general concept of the subgradient.
The subgradient of a convex function \(f\) at a point \(x\) is a vector \(g\) such that \(f(y) \ge f(x) + g^T(y-x)\) for all \(y\). The set of all such vectors is called the subdifferential, denoted \(\partial f(x)\). For our loss function \(L(X) = h(g(X))\) where \(h(Z) = \lVert Z \rVert_{\alpha}\) and \(g(X) = AX - B\), the chain rule for subdifferentials states that \(\partial_X L(X) = A^T \partial_Z h(AX-B)\).
Let \(\lVert \cdot \rVert_\delta\) be the dual norm to the loss norm \(\lVert \cdot \rVert_\alpha\). (If \(\alpha\) is \(S_p\), \(\delta\) is \(S_q\); if \(\alpha\) is \(e,p\), \(\delta\) is \(e,q\)). Almost by definition of the dual norm (which we give later), the subdifferential of the norm \(\lVert \cdot \rVert_\alpha\) at a matrix \(Z \neq 0\) is given by where (\(\langle S, Z \rangle_F = \text{Tr}(S^T Z)\) is the standard Frobenius inner product):
\[ \partial_Z \lVert Z \rVert_{\alpha} = \{ S \in \mathbb{R}^{n \times m} \mid \lVert S \rVert_{\delta} = 1 \text{ and } \langle S, Z \rangle_F = \lVert Z \rVert_{\alpha} \} \]
Consequently, the subdifferential of our loss function \(L(X)\) (for \(X \neq X^*\)) is: \[ \partial_X L(X) = \{ A^T S \mid S \in \mathbb{R}^{n \times m}, \lVert S \rVert_{\delta} = 1, \langle S, AX-B \rangle_F = \lVert AX-B \rVert_{\alpha} \} \] Since our matrix \(A\) is symmetric, \(A^T = A\). For the purpose of analyzing a continuous-time gradient flow, we can consider an optimizer that, at each time \(t\), selects one specific subgradient \(G_t \in \partial_X L(X_t)\) to determine its direction.
The Family of Geometric Optimizers
We define a family of optimizers based on the principle of continuous steepest descent. The direction of steepest descent depends on the geometry of the underlying space, which we define by choosing a norm, \(\lVert \cdot \rVert_\beta\), on the space of matrices \(\mathbb{R}^{n \times m}\).
Definition 2.3. (Dual norm) For any norm \(\lVert \cdot \rVert_\beta\) on a vector space, its dual norm, denoted \(\lVert \cdot \rVert_\gamma\), is defined for a matrix \(G\) as:
\[ \lVert G \rVert_\gamma \coloneqq \sup_{Z : \lVert Z \rVert_\beta \le 1} \langle G, Z \rangle_F \]
where \(\langle G, Z \rangle_F = \text{Tr}(G^T Z)\) is the standard Frobenius inner product. The dual norm measures the maximum projection of \(G\) onto the unit ball of the primal norm \(\lVert \cdot \rVert_\beta\).
It is left to the curious reader to verify that such a mapping is indeed a norm, and all the properties of dual norms mentioned in the previous section.
Definition 2.4. (Dualizer) The dualizer for a norm \(\lVert \cdot \rVert_\beta\) is a map, \(\text{dualizer}_{\lVert \cdot \rVert_\beta}(\cdot)\), which for any input matrix \(G \neq 0\), returns the direction \(Z\) that achieves the supremum in the dual norm definition. That is, \(\text{dualizer}_{\lVert \cdot \rVert_\beta}(G)\) is a matrix \(Z\) such that:
\[ \lVert Z \rVert_\beta=1 \quad \text{and} \quad \langle G, Z \rangle_F = \lVert G \rVert_\gamma \]
The dualizer identifies the direction in the unit ball of \(\lVert \cdot \rVert_\beta\) that is most aligned with \(G\).
Definition 2.5. (Continuous-time gradient flow for an optimizer) The continuous-time gradient flow for an optimizer is defined by moving in the direction of steepest descent, which is precisely the negative of the dualized gradient. For a chosen subgradient \(G_t \in \partial_X L(X_t)\), the dynamics are:
\[ \frac{dX_t}{dt} = -\eta \cdot \text{dualizer}_{\lVert \cdot \rVert_\beta}(G_t) \]
where \(\eta > 0\) is a constant learning rate that scales the step size.
A Convergence Theorem
Theorem 3.1. For the optimization problem with a loss norm \(\lVert \cdot \rVert_\alpha\) that is either a Schatten-p norm (\(p \in [1, \infty]\)) or an element-wise p-norm (\(p \in [1, \infty]\)), any optimizer defined by the continuous-time gradient flow under a chosen “geometry”-inducing norm \(\lVert \cdot \rVert_\beta\) converges to the unique global minimum \(X^*\) in a finite amount of time, \(T_{\text{final}}\). This time is bounded as follows:
\[ \frac{L(X_0)}{\eta \cdot C_{\text{upper}}} \le T_{\text{final}} \le \frac{L(X_0)}{\eta \cdot C_{\text{lower}}} \]
where \(L(X_0) = \lVert AX_0 - B \rVert_{\alpha}\) is the initial loss, \(\lVert \cdot \rVert_\gamma\) is the dual norm to the optimizer’s norm \(\lVert \cdot \rVert_\beta\), and \(\lVert \cdot \rVert_\delta\) is the dual norm to the loss norm \(\lVert \cdot \rVert_\alpha\). The constants \(C_{\text{lower}}\) and \(C_{\text{upper}}\) are defined by the interaction of the matrix \(A\) with the chosen geometries:
\[ C_{\text{lower}} \coloneqq \inf_{S \in \mathbb{R}^{n \times m} : \lVert S\rVert _{\delta}=1} \lVert AS \rVert_\gamma \quad \text{and} \quad C_{\text{upper}} \coloneqq \sup_{S \in \mathbb{R}^{n \times m} : \lVert S\rVert _{\delta}=1} \lVert AS \rVert_\gamma \]
(Since \(A\) is SPD, we use \(A\) in place of its transpose \(A^T\).)
Proof. The proof proceeds in three steps: defining a potential function and its time derivative, bounding this derivative, and integrating the resulting differential inequality to find the time to convergence.
The Potential Function and its Time Derivative. We choose the objective function itself as a Lyapunov potential function: \(V(t) \coloneqq L(X_t) = \lVert AX_t - B \rVert_{\alpha}\). Its time derivative is the directional derivative in the direction of the flow, given by the inner product of a subgradient \(G_t\) with the velocity vector \(\frac{dX_t}{dt}\).
\[ \frac{dV}{dt} = \left\langle G_t, \frac{dX_t}{dt} \right\rangle_F = \left\langle G_t, -\eta \cdot \text{dualizer}_{\lVert \cdot \rVert_\beta}(G_t) \right\rangle_F = -\eta \left\langle G_t, \text{dualizer}_{\lVert \cdot \rVert_\beta}(G_t) \right\rangle_F \]
By the definition of the dualizer, \(\left\langle G_t, \text{dualizer}_{\lVert \cdot \rVert_\beta}(G_t) \right\rangle_F = \lVert G_t \rVert_\gamma\). Thus, we have a simple expression for the rate of change:
\[ \frac{dV}{dt} = -\eta \lVert G_t \rVert_\gamma \]
This shows that the potential function is non-increasing as long as the gradient is non-zero.
Bounding the Rate of Change. From the equation on subgradients, any subgradient \(G_t\) can be expressed as \(G_t = A S_t\) for some matrix \(S_t \in \mathbb{R}^{n \times m}\) which satisfies \(\lVert S_t \rVert_{\delta} = 1\). Substituting this into our expression for the derivative gives:
\[ \frac{dV}{dt} = -\eta \lVert A S_t \rVert_\gamma, \quad \text{where the matrix } S_t \text{ satisfies } \lVert S_t \rVert_{\delta} = 1. \]
The rate of decay \(\frac{dV}{dt}\) depends on the matrix \(S_t\), which is always constrained to lie on the unit sphere of the norm \(\lVert \cdot \rVert_\delta\). Therefore, the rate of change must lie between fixed bounds determined by the minimum and maximum possible values of \(\lVert A S \rVert_\gamma\) over this sphere.
\[ -\eta \sup_{S:\lVert S\rVert _{\delta}=1} \lVert AS \rVert_\gamma \le \frac{dV}{dt} \le -\eta \inf_{S:\lVert S\rVert _{\delta}=1} \lVert AS \rVert_\gamma \]
Using the definitions of \(C_{\text{lower}}\) and \(C_{\text{upper}}\) from the theorem statement, we obtain the inequality:
\[ -\eta C_{\text{upper}} \le \frac{dV}{dt} \le -\eta C_{\text{lower}} \]
Since \(A\) is SPD, its null space is trivial, so for any non-zero \(S\), \(AS\) is non-zero. Since norms are positive for non-zero inputs, \(C_{\text{lower}}\) is strictly positive.
Integration and Finite-Time Convergence. The differential inequality \(\frac{dV}{dt} \le -\eta C_{\text{lower}}\) guarantees that \(V(t)\) decreases at a rate of at least \(\eta C_{\text{lower}}\). To find an upper bound on the time to convergence, we integrate this from \(t=0\) to \(T_{\text{final}}\):
\[ \int_{V(0)}^{V(T_{\text{final}})} dV \le \int_{0}^{T_{\text{final}}} -\eta C_{\text{lower}} \, dt \]
Convergence occurs when \(V(T_{\text{final}}) = 0\). The initial value is \(V(0) = L(X_0)\).
\[ 0 - L(X_0) \le -\eta C_{\text{lower}} \cdot T_{\text{final}} \implies T_{\text{final}} \le \frac{L(X_0)}{\eta C_{\text{lower}}} \]
This provides the upper bound on the convergence time. The lower bound is found by integrating the other side, \(\frac{dV}{dt} \ge -\eta C_{\text{upper}}\):
\[ -L(X_0) \ge -\eta C_{\text{upper}} \cdot T_{\text{final}} \implies T_{\text{final}} \ge \frac{L(X_0)}{\eta C_{\text{upper}}} \]
This establishes both bounds and completes the proof. \(\blacksquare\)
Convergence Bounds for Specific Scenarios
We now instantiate the general result of Theorem 3.1 for three specific optimizers. The idea is to derive bounds for \(C_{\text{lower}}\) and \(C_{\text{upper}}\).
Remark 4.1. Bounding \(C\)
Our approach depends on the types of norms involved.
- Schatten-Schatten Norms: When both the optimizer’s dual norm (\(\gamma\)) and the loss’s dual norm (\(\delta\)) are Schatten norms, we can use a direct and powerful inequality: for any Schatten norm \(\lVert\cdot\rVert_{S_r}\), \(\lambda_{\min}(A) \lVert S \rVert_{S_r} \le \lVert AS \rVert_{S_r} \le \lambda_{\max}(A) \lVert S \rVert_{S_r}\). This allows for a very tight analysis.
- Mixed or Element-wise Norms: When at least one of the norms (\(\gamma\) or \(\delta\)) is an element-wise norm, the direct inequality above does not hold. In these cases, we use the Frobenius norm (\(\lVert \cdot \rVert_F\)) as a bridge (for convenience; there are stronger results for certain cases), as it is both a Schatten and an element-wise norm. We relate the norms \(\gamma\) and \(\delta\) to the Frobenius norm and use the fact that \(\lambda_{\min}(A) \lVert S \rVert_F \le \lVert AS \rVert_F \le \lambda_{\max}(A) \lVert S \rVert_F\).
There are more powerful ways of bounding these norms, but we will refrain from doing so for the sake of simplicity.
The following lemmas on norm equivalence, which are direct consequences of the generalized mean inequality, will be useful.
Lemma 4.2. (Schatten norm equivalence) Let \(S \in \mathbb{R}^{n \times m}\) be any matrix with \(k = \min(n, m)\). For any \(1 \le a \le b \le \infty\), the Schatten norms of \(S\) are related by:
\[ \lVert S \rVert_{S_b} \le \lVert S \rVert_{S_a} \le k^{\frac{1}{a}-\frac{1}{b}} \lVert S \rVert_{S_b} \]
Lemma 4.3. (Element-wise norm equivalence) Let \(S \in \mathbb{R}^{n \times m}\). For any \(1 \le a \le b \le \infty\), the element-wise norms of \(S\) are related by:
\[ \lVert S \rVert_{e,b} \le \lVert S \rVert_{e,a} \le (mn)^{\frac{1}{a}-\frac{1}{b}} \lVert S \rVert_{e,b} \]
Normalized Gradient Descent
We will consider the variant of Normalized Gradient Descent (NGD) which divides the gradient throughout by its Frobenius norm. That is,
\[ \frac{dX_t}{dt} = -\eta \frac{G_t}{\lVert G_t \rVert _F} \]
Let’s try to identify the geometry of the optimizer. The dual norm is \(\lVert G_t \rVert _\gamma = \langle G_t, \frac{G_t}{\lVert G_t \rVert _F} \rangle_F = \lVert G_t \rVert _F\). Since the Frobenius norm is self-dual, the optimizer’s defining norm is also Frobenius.
Case 1: Schatten-p Norm Loss
Here \(\lVert \cdot \rVert_\alpha = \lVert \cdot \rVert_{S_p}\) and its dual is \(\lVert \cdot \rVert_\delta = \lVert \cdot \rVert_{S_q}\). Since both \(\gamma\) and \(\delta\) are Schatten norms, we use the direct analysis.
We need to bound \(C_{\text{lower/upper}} = \inf/\sup_{\lVert S\rVert _{S_q}=1} \lVert AS \rVert_F\). Using the direct bound: \(\lambda_{\min}(A)\lVert S \rVert_F \le \lVert AS \rVert_F \le \lambda_{\max}(A)\lVert S \rVert_F\). This gives \(C_{\text{lower}} \ge \lambda_{\min}(A) \inf_{\lVert S\rVert _{S_q}=1} \lVert S \rVert_F\) and \(C_{\text{upper}} \le \lambda_{\max}(A) \sup_{\lVert S\rVert _{S_q}=1} \lVert S \rVert_F\). We use Lemma 4.2 to find the inf/sup of \(\lVert S \rVert_F = \lVert S \rVert_{S_2}\) given \(\lVert S \rVert_{S_q}=1\).
If \(p \ge 2\) (so \(1 \le q \le 2\)): We compare norms \(S_q\) and \(S_2\). Here \(a=q, b=2\). \(k^{\frac{1}{2}-\frac{1}{q}} \lVert S \rVert_{S_q} \le \lVert S \rVert_{S_2} \le \lVert S \rVert_{S_q}\). With \(\lVert S \rVert_{S_q}=1\), we have \(\inf \lVert S \rVert_F = k^{\frac{1}{2}-\frac{1}{q}}\) and \(\sup \lVert S \rVert_F = 1\). \(C_{\text{lower}} \ge \lambda_{\min}(A) k^{\frac{1}{2}-\frac{1}{q}}\) and \(C_{\text{upper}} \le \lambda_{\max}(A)\).
If \(p \le 2\) (so \(q \ge 2\)): We compare norms \(S_2\) and \(S_q\). Here \(a=2, b=q\). \(\lVert S \rVert_{S_q} \le \lVert S \rVert_{S_2} \le k^{\frac{1}{2}-\frac{1}{q}} \lVert S \rVert_{S_q}\). With \(\lVert S \rVert_{S_q}=1\), we have \(\inf \lVert S \rVert_F = 1\) and \(\sup \lVert S \rVert_F = k^{\frac{1}{2}-\frac{1}{q}}\). \(C_{\text{lower}} \ge \lambda_{\min}(A)\) and \(C_{\text{upper}} \le \lambda_{\max}(A) k^{\frac{1}{2}-\frac{1}{q}}\).
Corollary 4.4. For NGD with Schatten-\(p\) loss:
If \(p \ge 2\) (\(1 \le q \le 2\)):
\[ \frac{\lVert AX_0 - B \rVert_{S_p}}{\eta \lambda_{\max}(A)} \le T_{\text{final}} \le \frac{\lVert AX_0 - B \rVert_{S_p}}{\eta \lambda_{\min}(A) k^{\frac{1}{2}-\frac{1}{q}}} \]
If \(p \le 2\) (\(q \ge 2\)):
\[ \frac{\lVert AX_0 - B \rVert_{S_p}}{\eta \lambda_{\max}(A) k^{\frac{1}{2}-\frac{1}{q}}} \le T_{\text{final}} \le \frac{\lVert AX_0 - B \rVert_{S_p}}{\eta \lambda_{\min}(A)} \]
Case 2: Element-wise p-Norm Loss
The proof is perfectly analogous to the Schatten case, but since the loss norm is element-wise, we will use the Frobenius-bridging method. The bounds for \(\lVert S \rVert_F\) are found using Lemma 4.3, which replaces \(k\) with \(mn\).
Corollary 4.5. For NGD with element-wise $p$-norm loss:
If \(p \ge 2\) (\(1 \le q \le 2\)):
\[ \frac{\lVert AX_0 - B \rVert_{e,p}}{\eta \lambda_{\max}(A)} \le T_{\text{final}} \le \frac{\lVert AX_0 - B \rVert_{e,p}}{\eta \lambda_{\min}(A) (mn)^{\frac{1}{2}-\frac{1}{q}}} \]
If \(p \le 2\) (\(q \ge 2\)):
\[ \frac{\lVert AX_0 - B \rVert_{e,p}}{\eta \lambda_{\max}(A) (mn)^{\frac{1}{2}-\frac{1}{q}}} \le T_{\text{final}} \le \frac{\lVert AX_0 - B \rVert_{e,p}}{\eta \lambda_{\min}(A)} \]
Muon
Note that Muon orthogonalizes the gradient, i.e., if \(G_t = U_t \Sigma_t V_t^T\), then \(\frac{dX_t}{dt} = -\eta \cdot U_t V_t^T\). Let’s try to see whether this fits into our definition. The dual norm would be \(\lVert G_t \rVert _\gamma = \langle U_t \Sigma_t V_t^T, U_t V_t^T \rangle_F = \text{Tr}(V_t U_t^T U_t \Sigma_t V_t^T) = \text{Tr}(V_t \Sigma_t V_t^T) = \text{Tr}(\Sigma_t) = \lVert G_t \rVert _*\). So the dual norm is the Nuclear norm (\(S_1\)). Hence the defining norm is the spectral norm (\(S_\infty\)).
Case 1: Schatten-p Norm Loss
Here, the loss norm is \(\lVert \cdot \rVert_\alpha = \lVert \cdot \rVert_{S_p}\) and its dual is \(\lVert \cdot \rVert_\delta = \lVert \cdot \rVert_{S_q}\). Both the optimizer’s dual norm (\(\gamma=S_1\)) and the loss’s dual norm (\(\delta=S_q\)) are Schatten norms, so we use the direct analysis for a tighter bound.
We need to bound \(C_{\text{lower/upper}} = \inf/\sup_{\lVert S\rVert _{S_q}=1} \lVert AS \rVert_{S_1}\). The direct method gives \(\lambda_{\min}(A)\lVert S \rVert_{S_1} \le \lVert AS \rVert_{S_1} \le \lambda_{\max}(A)\lVert S \rVert_{S_1}\). Thus, \(C_{\text{lower}} \ge \lambda_{\min}(A) \inf_{\lVert S\rVert _{S_q}=1} \lVert S \rVert_{S_1}\) and \(C_{\text{upper}} \le \lambda_{\max}(A) \sup_{\lVert S\rVert _{S_q}=1} \lVert S \rVert_{S_1}\). We use Lemma 4.2 to compare the \(S_1\) and \(S_q\) norms. Since \(q \ge 1\), we always have \(a=1, b=q\). The inequality is \(\lVert S \rVert_{S_q} \le \lVert S \rVert_{S_1} \le k^{1-\frac{1}{q}} \lVert S \rVert_{S_q}\). Given \(\lVert S \rVert_{S_q} = 1\), we get \(1 \le \lVert S \rVert_{S_1} \le k^{1-\frac{1}{q}}\). Therefore, \(\inf \lVert S \rVert_{S_1} = 1\) and \(\sup \lVert S \rVert_{S_1} = k^{1-\frac{1}{q}}\). This yields the bounds: \(C_{\text{lower}} \ge \lambda_{\min}(A)\) and \(C_{\text{upper}} \le \lambda_{\max}(A) k^{1-\frac{1}{q}}\). Noting that \(1-1/q = 1/p\), the factor can be written as \(k^{1/p}\).
Corollary 4.6. For Muon with Schatten-\(p\) loss, for any \(p \in [1, \infty]\), the convergence time is bounded by:
\[ \frac{\lVert AX_0 - B \rVert_{S_p}}{\eta \lambda_{\max}(A) k^{1/p}} \le T_{\text{final}} \le \frac{\lVert AX_0 - B \rVert_{S_p}}{\eta \lambda_{\min}(A)} \]
Case 2: Element-wise p-Norm Loss
Here \(\lVert \cdot \rVert_\alpha = \lVert \cdot \rVert_{e,p}\), so \(\lVert \cdot \rVert_\delta = \lVert \cdot \rVert_{e,q}\). The optimizer norm is \(\gamma=S_1\). This is a mixed-norm case, so we must use the Frobenius bridge.
We bound \(C_{\text{lower/upper}} = \inf/\sup_{\lVert S\rVert _{e,q}=1} \lVert AS \rVert_{S_1}\). We use the inequalities \(\lVert Z \rVert_F \le \lVert Z \rVert_{S_1} \le \sqrt{k} \lVert Z \rVert_F\). \(C_{\text{lower}} \ge \inf \lVert AS \rVert_F \ge \lambda_{\min}(A) \inf \lVert S \rVert_F\). \(C_{\text{upper}} \le \sup \sqrt{k} \lVert AS \rVert_F \le \sqrt{k} \lambda_{\max}(A) \sup \lVert S \rVert_F\). The infimum and supremum are over the set \(\{S \mid \lVert S \rVert_{e,q}=1\}\), and are found using Lemma 4.3 relating \(\lVert \cdot \rVert_{e,q}\) and \(\lVert \cdot \rVert_F = \lVert \cdot \rVert_{e,2}\).
Corollary 4.7. For Muon with element-wise $p$-norm loss:
If \(p \ge 2\) (\(1 \le q \le 2\)):
\[ \frac{\lVert AX_0 - B \rVert_{e,p}}{\eta \sqrt{k} \lambda_{\max}(A)} \le T_{\text{final}} \le \frac{\lVert AX_0 - B \rVert_{e,p}}{\eta \lambda_{\min}(A) (mn)^{\frac{1}{2}-\frac{1}{q}}} \]
If \(p \le 2\) (\(q \ge 2\)):
\[ \frac{\lVert AX_0 - B \rVert_{e,p}}{\eta \sqrt{k} \lambda_{\max}(A) (mn)^{\frac{1}{2}-\frac{1}{q}}} \le T_{\text{final}} \le \frac{\lVert AX_0 - B \rVert_{e,p}}{\eta \lambda_{\min}(A)} \]
Adam/SignSGD
As usual, let’s try to see whether this algorithm fits in our formulation. The update is \(\frac{dX_t}{dt} = -\eta \cdot \text{sign}(G_t)\). The dual norm would be \(\lVert G_t \rVert _\gamma = \langle G_t, \text{sign}(G_t) \rangle _F = \text{Tr}(G_t^T \text{sign}(G_t)) = \lVert G_t \rVert_{e, 1}\), the element-wise \(L_1\) norm. The geometry is thus induced by the element-wise \(L_\infty\) norm, \(\lVert \cdot \rVert_{e, \infty}\).
Remark 4.8. SignSGD can be viewed as a limiting case of the Adam optimizer where both moment estimates rapidly converge to the current gradient and its square.
Case 1: Schatten-p Norm Loss
Here \(\lVert \cdot \rVert_\alpha = \lVert \cdot \rVert_{S_p}\) and \(\lVert \cdot \rVert_\delta = \lVert \cdot \rVert_{S_q}\). This is a mixed-norm case.
We need to bound \(C_{\text{lower/upper}} = \inf/\sup_{\lVert S\rVert _{S_q}=1} \lVert AS \rVert_{e,1}\). For the lower bound, we use the Frobenius bridge: \(C_{\text{lower}} \ge \inf \lVert AS \rVert_F \ge \lambda_{\min}(A) \inf_{\lVert S\rVert _{S_q}=1} \lVert S \rVert_F\). For the upper bound, we can use a tighter bridge than the Frobenius norm. We use the chain of inequalities \(\lVert Z \rVert_{e,1} \le \sqrt{k} \lVert Z \rVert_*\) and \(\lVert Z \rVert_* \le \sqrt{k} \lVert Z \rVert_F\). \(C_{\text{upper}} = \sup \lVert AS \rVert_{e,1} \le \sup \sqrt{k}\lVert AS \rVert_* \le \sqrt{k} \lambda_{\max}(A) \sup_{\lVert S\rVert _{S_q}=1} \lVert S \rVert_*\). We need the inf/sup of \(\lVert S \rVert_F\) and \(\lVert S \rVert_*\) over the $S_q$-sphere. From Lemma 4.2: \(\sup_{\lVert S\rVert _{S_q}=1} \lVert S \rVert_* = \sup \lVert S \rVert_{S_1} = k^{1-1/q} = k^{1/p}\). This gives \(C_{\text{upper}} \le \sqrt{k} \lambda_{\max}(A) k^{1/p}\) for all \(p\).
The bounds for \(\inf \lVert S \rVert_F\) depend on \(q\)’s relation to 2, as derived for NGD.
If \(p \ge 2\) (\(1 \le q \le 2\)): \(\inf \lVert S \rVert_F = k^{\frac{1}{2}-\frac{1}{q}}\).
If \(p \le 2\) (\(q \ge 2\)): \(\inf \lVert S \rVert_F = 1\).
Corollary 4.9. For Adam/SignSGD with Schatten-\(p\) loss:
If \(p \ge 2\) (\(1 \le q \le 2\)):
\[ \frac{\lVert AX_0 - B \rVert_{S_p}}{\eta \sqrt{k} \lambda_{\max}(A) k^{1/p}} \le T_{\text{final}} \le \frac{\lVert AX_0 - B \rVert_{S_p}}{\eta \lambda_{\min}(A) k^{\frac{1}{2}-\frac{1}{q}}} \]
If \(p \le 2\) (\(q \ge 2\)):
\[ \frac{\lVert AX_0 - B \rVert_{S_p}}{\eta \sqrt{k} \lambda_{\max}(A) k^{1/p}} \le T_{\text{final}} \le \frac{\lVert AX_0 - B \rVert_{S_p}}{\eta \lambda_{\min}(A)} \]
Case 2: Element-wise p-Norm Loss
Here \(\lVert \cdot \rVert_\alpha = \lVert \cdot \rVert_{e,p}\), so \(\lVert \cdot \rVert_\delta = \lVert \cdot \rVert_{e,q}\).
We need to bound \(C_{\text{lower/upper}} = \inf/\sup_{\lVert S\rVert _{e,q}=1} \lVert AS \rVert_{e,1}\). For the lower bound: \(C_{\text{lower}} \ge \inf \lVert AS \rVert_F \ge \lambda_{\min}(A) \inf_{\lVert S\rVert _{e,q}=1} \lVert S \rVert_F\). For the upper bound, we use the tighter bridge \(\lVert Z \rVert_{e,1} \le k \lVert Z \rVert_F\), which comes from \(\lVert Z \rVert_{e,1} \le \sqrt{k} \lVert Z \rVert_* \le \sqrt{k}\sqrt{k} \lVert Z \rVert_F\). This is tighter than the standard \(\sqrt{mn}\) factor. \(C_{\text{upper}} \le \sup k \lVert AS \rVert_F \le k \lambda_{\max}(A) \sup_{\lVert S\rVert _{e,q}=1} \lVert S \rVert_F\). The required inf/sup of \(\lVert S \rVert_F = \lVert S \rVert_{e,2}\) over the $e,q$-sphere come from Lemma 4.3.
If \(p \ge 2\) (\(1 \le q \le 2\)): \(\inf \lVert S \rVert_F = (mn)^{\frac{1}{2}-\frac{1}{q}}\) and \(\sup \lVert S \rVert_F = 1\).
If \(p \le 2\) (\(q \ge 2\)): \(\inf \lVert S \rVert_F = 1\) and \(\sup \lVert S \rVert_F = (mn)^{\frac{1}{2}-\frac{1}{q}}\).
Corollary 4.10. For Adam/SignSGD with element-wise $p$-norm loss:
If \(p \ge 2\) (\(1 \le q \le 2\)):
\[ \frac{\lVert AX_0 - B \rVert_{e,p}}{\eta k\lambda_{\max}(A)} \le T_{final} \le \frac{\lVert AX_0 - B \rVert_{e,p}}{\eta \lambda_{\min}(A) (mn)^{\frac{1}{2}-\frac{1}{q}}} \]
If \(p \le 2\) (\(q \ge 2\)):
\[ \frac{\lVert AX_0 - B \rVert_{e,p}}{\eta k\lambda_{\max}(A)(mn)^{\frac{1}{2}-\frac{1}{q}}} \le T_{final} \le \frac{\lVert AX_0 - B \rVert_{e,p}}{\eta \lambda_{\min}(A)} \]
A Note on Bound Tightness and Rate Variation
This note analyzes two key properties of the derived bounds: the potential variation in the convergence rate and the conditions under which the bounds are tight.
Analysis of Rate Variation
The ratio \(C_{\text{upper}} / C_{\text{lower}}\) measures the stability of the convergence rate. A value near one implies a nearly constant rate, while a large value indicates substantial possible variation (both in the current path as well as distinct starting guesses). Let’s analyze this for the direct Schatten-Schatten analysis, for example, Muon (\(\gamma=S_1\)) with Schatten loss (\(\delta=S_q\)).
\[ \frac{C_{\text{upper}}}{C_{\text{lower}}} = \kappa(A) k^{1/p} \]
In this case, the rate variation depends on the conditioning of \(A\) and a dimensional factor related to the loss geometry. For the more complex bridged analyses, the ratio will also include factors from the norm equivalences used. The general structure holds: rate variation is a product of contributions from the problem’s conditioning (\(\kappa(A)\)), the optimizer’s geometry, and the loss function’s geometry. In general, the closer the problem geometry is to the optimizer, the better the convergence guarantees there will be — after all, we are doing some kind of steepest descent under some matrix norm.
Let’s compare this to the corresponding bound for the Adam case:
\[ \frac{C_{\text{upper}}}{C_{\text{lower}}} = \kappa(A) k^{1/p + 1/\min(2, q)} \]
For any Schatten-p norm, we see that Muon gives better convergence guarantees (according to our analysis) as compared to Adam, though it should be kept in mind that the bounds for Adam are not as tight.
Analysis of Bound Tightness
We now investigate whether our bounds on \(C_{\text{lower}}\) and \(C_{\text{upper}}\) are achievable.
Case 1: Schatten-Schatten Analysis (Direct)
This case applies when both the optimizer’s dual norm (\(\gamma\)) and the loss’s dual norm (\(\delta\)) are Schatten norms (e.g., Muon with Schatten-p loss). The analysis uses the direct inequality \(\lambda_{\min}(A)\lVert S \rVert_{\gamma} \le \lVert AS \rVert_{\gamma} \le \lambda_{\max}(A)\lVert S \rVert_{\gamma}\) followed by a norm equivalence bound.
Lower Bound (\(C_{\text{lower}}\)): Our bound is \(C_{\text{lower}} \ge \lambda_{\min}(A) \inf_{\lVert S\lVert _{\delta}=1} \lVert S \rVert_\gamma\). To achieve this bound, a single matrix \(S\) must satisfy two conditions simultaneously:
- \(\lVert AS \rVert_\gamma = \lambda_{\min}(A) \lVert S \rVert_\gamma\): This is achieved if \(S\) lies in an eigenspace of \(A\) associated with the minimum eigenvalue, \(\lambda_{\min}(A)\). For example, if \(S=uv^T\) where \(u,v\) are eigenvectors for \(\lambda_{\min}(A)\).
- \(S\) must be a matrix that achieves the infimum of \(\lVert S \rVert_\gamma\) over the unit sphere of \(\lVert \cdot \rVert_\delta\). For Muon (\(\gamma=S_1\)) and any Schatten loss (\(\delta=S_q\)), this infimum is 1 and is achieved by any rank-one matrix.
A rank-one matrix \(S=uv^T\) constructed from eigenvectors of \(\lambda_{\min}(A)\) satisfies both conditions. Therefore, the lower bound is tight.
Upper Bound (\(C_{\text{upper}}\)): Our bound is \(C_{\text{upper}} \le \lambda_{\max}(A) \sup_{\lVert S\lVert _{\delta}=1} \lVert S \rVert_\gamma\). This requires a matrix \(S\) to satisfy:
- \(\lVert AS \rVert_\gamma = \lambda_{\max}(A) \lVert S \rVert_\gamma\): This requires \(S\) to lie in the eigenspace of \(\lambda_{\max}(A)\).
- \(S\) must achieve the supremum of \(\lVert S \rVert_\gamma\) over the unit sphere of \(\lVert \cdot \rVert_\delta\). For Muon (\(\gamma=S_1\), \(\delta=S_q\)), this supremum is \(k^{1/p}\) and is achieved by a matrix whose non-zero singular values are all equal (a “spectrally flat” (or whitened) matrix).
This is achievable if the eigenspace of \(\lambda_{\max}(A)\) is large enough (i.e., its multiplicity is at least \(k=\min(n,m)\)) to construct a spectrally flat, rank-\(k\) matrix within it. If \(\lambda_{\max}(A)\) is a simple (non-repeated) eigenvalue, this is generally not possible for \(k>1\). Thus, the upper bound is tight under certain conditions on \(A\), but may be conservative otherwise.
Case 2: Element-wise/Element-wise Analysis (Bridged)
This case applies to Adam/SignSGD with an element-wise p-norm loss. The analysis uses the Frobenius norm as a bridge.
Lower Bound (\(C_{\text{lower}} \ge \lambda_{\min}(A) \inf_{\lVert S\lVert _{e,q}=1} \lVert S \rVert_F\)):
- \(\lVert AS \rVert_F = \lambda_{\min}(A) \lVert S \rVert_F\): Requires \(S\) to lie in the eigenspace of \(\lambda_{\min}(A)\).
- \(S\) must achieve \(\inf_{\lVert S\lVert _{e,q}=1} \lVert S \rVert_F\). This is achieved by a matrix with only one non-zero entry (an extremely sparse matrix, e.g., \(S=E_{ij}\)).
The bound is tight only if a basis matrix \(E_{ij}\) is an eigenvector of \(A\). This only occurs if \(A\) is a diagonal matrix. For general non-diagonal \(A\), the conditions are in conflict, making the lower bound conservative.
Upper Bound (\(C_{\text{upper}} \le \lambda_{\max}(A) \sup_{\lVert S\lVert _{e,q}=1} \lVert S \rVert_F\)):
- \(\lVert AS \rVert_F = \lambda_{\max}(A) \lVert S \rVert_F\): Requires \(S\) to lie in the eigenspace of \(\lambda_{\max}(A)\).
- \(S\) must achieve \(\sup_{\lVert S\lVert _{e,q}=1} \lVert S \rVert_F\). This is achieved by a matrix whose non-zero entries all have the same absolute value (a “dense” matrix, like the all-ones matrix \(J\)).
This is tight only if a matrix of constant-magnitude entries (like \(J\)) is an eigenvector of \(A\). This is true for highly structured matrices (e.g., \(A=c_1I + c_2J\)) but not for general \(A\). The upper bound is therefore also conservative.
Case 3: Mixed-Norm Analysis (Bridged)
This case applies to Adam/SignSGD with Schatten loss or Muon with element-wise loss. The analysis involves inequalities bridging between Schatten and element-wise norms (e.g., \(\lVert Z \rVert_{e,1} \le \sqrt{k} \lVert Z \rVert_*\) or \(\lVert Z \rVert_F \le \lVert Z \rVert_*\)).
Lower Bound (\(C_{\text{lower}}\)): Consider Adam/SignSGD with Schatten loss, where \(C_{\text{lower}} \ge \lambda_{\min}(A) \inf_{\lVert S\lVert _{S_q}=1} \lVert S \rVert_F\).
- \(\lVert AS \rVert_F = \lambda_{\min}(A) \lVert S \rVert_F\): Requires \(S\) to be in the \(\lambda_{\min}(A)\) eigenspace.
- \(S\) must achieve \(\inf_{\lVert S\lVert _{S_q}=1} \lVert S \rVert_F\). This requires \(S\) to be rank-one (if \(q \ge 2\)) or spectrally flat (if \(q < 2\)).
For \(q \ge 2\), we require a rank-one matrix from the \(\lambda_{\min}(A)\) eigenspace, which is always possible. So for this sub-case, the bound is tight. For \(q < 2\), it requires a spectrally flat matrix in that eigenspace, which is only possible if \(\lambda_{\min}(A)\) is degenerate.
Upper Bound (\(C_{\text{upper}}\)): Consider Adam/SignSGD again, with the bound \(C_{\text{upper}} \le \sqrt{k} \lambda_{\max}(A) \sup_{\lVert S\lVert _{S_q}=1} \lVert S \rVert_*\).
- \(\lVert AS \rVert_{e,1} \le \sqrt{k} \lVert AS \rVert_*\): The first bridging step. Equality requires \(AS\) to be a rank-one matrix whose entries have equal absolute magnitude.
- \(\lVert AS \rVert_* \le \lambda_{\max}(A) \lVert S \rVert_*\): The spectral bound. Equality requires \(S\) to be in the \(\lambda_{\max}(A)\) eigenspace.
This requires that a matrix \(S\) from the \(\lambda_{\max}(A)\) eigenspace produces a matrix \(AS\) that is both rank-one and has entries of equal magnitude. These are generally conflicting geometric requirements. If \(S\) is rank-one, \(AS\) will be too, but requiring its entries to have equal magnitude is a very strong condition on \(A\). This bound is therefore typically very conservative.
In summary, the direct analysis for matching Schatten norms yields tight or near-tight bounds. In contrast, any analysis that must bridge between element-wise and Schatten “geometries” introduces conflicting requirements, leading to bounds that are more likely to be conservative estimates.
Using This In Practice: A Comparison of Optimizers
The theoretical framework developed provides another lens through which to compare familiar optimization algorithms. By understanding their underlying geometry, we can move beyond empirical performance and attempt to reason about why they behave differently.
Let us reiterate the warning from the beginning - intuition built from toy problems can only go so far.
The Problem with Vanilla Gradient Descent
Standard Gradient Descent (GD) does not belong to the family of “geometric optimizers” analyzed here. Its continuous-time flow is given by:
\[ \frac{dX_t}{dt} = -\eta G_t = -\eta A S_t \]
Notice the absence of the \(\text{dualizer}\) map. The update magnitude is directly proportional to the norm of the gradient, \(\lVert G_t \rVert\). As the optimizer approaches the solution \(X^*\), the residual \(AX_t-B\) shrinks, causing the gradient norm to approach zero. This leads to ever-slowing convergence.
For the classic case of a Frobenius norm loss (\(\alpha=F\)), the gradient is \(G_t = A(AX_t - B)\). The flow becomes a linear ODE, whose exact solution is:
\[ X_t - X^* = e^{-A^2 \eta t} (X_0 - X^*) \]
This explicitly shows that the error term decays exponentially but only vanishes as \(t \to \infty\). Standard GD offers infinite-time asymptotic convergence.
Geometric Optimizers as the Solution
The geometric optimizers studied in this post — NGD, Muon, and Adam/SignSGD — all “fix” this problem of vanishing updates. They achieve this by normalizing the gradient, ensuring the update step has a constant size with respect to a norm. The update rule \(\frac{dX_t}{dt} = -\eta \cdot \text{dualizer}_{\lVert \cdot \rVert_\beta}(G_t)\) has the property that \(\lVert \frac{dX_t}{dt} \rVert_\beta = \eta\), a constant. This fixed-size progress in the chosen norm forces the optimizer to reach the zero-loss solution in a finite amount of time. However, if in an exceptionally smooth loss landscape (like the one we use), there will be oscillations in the discrete step size case.
Comparing the structure for Normalized Gradient Descent, Muon, and Adam/SignSGD
While all three ensure finite-time convergence, their performance characteristics, revealed by the convergence bounds, are quite different because they are associated with different norms.
Normalized Gradient Descent (NGD): This optimizer uses a Euclidean norm (\(\beta=F\)). It normalizes the entire gradient matrix \(G_t\), scaling it down to have a Frobenius norm of 1. It treats all entries and all singular values of the gradient democratically. It is a good baseline but may be blind to more nuanced structures in the problem.
Muon: This optimizer uses the spectral norm (\(\beta=S_\infty\)). If the gradient has an SVD of \(G_t = U\Sigma V^T\), the update direction is the matrix \(-UV^T\). This update is an isometry that encapsulates the “pure orientation” of the gradient, stripped of all singular value scaling. It’s a whitened update that respects the singular vector spaces of the gradient. This makes it highly sensitive to the spectral properties of the error, treating every singular direction present in the gradient with equal importance.
SignSGD and Adam: The SignSGD optimizer uses an element-wise norm (\(\beta=e,\infty\), so \(\gamma=e,1\)). Its dualizer is the element-wise sign function, so every entry of the update matrix is either \(+1\) or \(-1\) (scaled by \(\eta\)). This is closely related to the popular Adam optimizer. In Adam, the gradient is divided element-wise by the square root of its squared moving average (\(v_t\)). If the momentum terms are ignored (or in a quasi-stationary limit where the estimates have converged), the Adam update is approximately \(\hat{m}_t / (\sqrt{\hat{v}_t} + \epsilon) \approx \text{sign}(G_t)\). Our analysis of SignSGD can therefore be seen as a theoretical characterization of an idealized Adam with \(\epsilon=0\). This geometry considers each matrix entry independently, making it effective when errors are sparse or scattered without a clear spectral pattern.
A final note on momentum. While we have not studied it here, momentum can be added to any of these gradient flows by adding a coupled differential equation. It often accelerates convergence but can also introduce oscillations, as it gives the optimizer “mass” and inertia. It is possible to do such an analysis but we will leave it for another time, since it is mathematically much more involved. Gabriel Goh’s article on momentum at distill.pub explains the key intuitions around momentum quite well, explaining how it can lead to as big as a square-root reduction in the “effective” condition number which has important consequences for convergence.
The Nature of the Updates
The choice of the geometry dictates the very structure of the update step, \(\Delta X = -\eta \cdot \text{dualizer}(G_t)\).
Muon (with \(\beta=S_\infty\), \(\gamma=S_1\)) produces a full-rank, spectrally flat update matrix, \(UV^T\), where \(G_t=U\Sigma V^T\). This update preserves the singular vector spaces of the gradient but replaces all non-zero singular values with 1. It is a “pure rotation” (or isometry) in the direction of the error. It changes the orientation of the solution without preferentially scaling any spectral direction.
Adam/SignSGD (with \(\beta=e,\infty\), \(\gamma=e,1\)) produces a dense, full-rank update matrix whose entries are all \(\pm \eta\). It modifies every single parameter of \(X\) at every step, with the magnitude of the change being independent of the magnitude of the gradient for that entry.
NGD (with \(\beta=F\), \(\gamma=F\)) produces a full-rank update that is a scaled version of the gradient, preserving its relative structure completely.
However, in practice, this is a different situation (this is also why we should take all intuition from toy problems with a grain of salt). The gradients \(G_t\) produced by standard methods like SGD or Adam are often empirically ill-conditioned or “spectrally spiky”. This means their singular value spectrum is dominated by a few very large values, with a long tail of small ones, making the gradient matrix effectively low-rank. Adam/SignSGD’s approach is to bet on element-wise whitening, which doesn’t work so well. Muon explicitly performs the spectral whitening (i.e., orthogonalization) which discards information from the singular values and updates based on all singular vectors equally — the idea is to increase the contribution of rare but nevertheless important directions.
Also note that since the weights themselves are EMAs of gradients with weight decay, this might also lead to more effective use of the capacity of the network if the weight decay is too fast.
Building Intuition from the Convergence Bounds
The bounds derived in Theorem 3.1 and instantiated in Section 4 can provide some intuition about how to choose and reason about optimizers.
The Meaning of \(C_{\text{lower}}\) and \(C_{\text{upper}}\)
The main result states that \(T_{\text{final}} \approx L(X_0) / (\eta C)\). \(C\) can be thought of as a speed. Specifically, \(C = \lVert AS \rVert_\gamma\) measures how “large” the gradient step is in the optimizer’s geometry (\(\gamma\)), given a unit-sized error “direction” \(S\) in the loss’s geometry (\(\delta\)).
The constants \(C_{\text{lower}}\) and \(C_{\text{upper}}\) represent the worst-case and best-case scenarios for this rate.
- \(C_{\text{lower}}\) is the slowest possible rate of convergence. A large \(C_{\text{lower}}\) gives a strong worst-case guarantee.
- \(C_{\text{upper}}\) is the fastest possible rate. It tells you the best you can hope for.
Rate Stability and Comparing Optimizers
The ratio \(C_{\text{upper}} / C_{\text{lower}}\) is a measure of the stability or robustness of the convergence rate.
- A ratio close to 1 means the optimizer makes steady, predictable progress regardless of the specific error structure at any given time/state, or the starting-point.
- A large ratio implies the convergence can be highly variable: sometimes very fast, sometimes very slow. This happens when the optimizer’s geometry is poorly matched to certain error structures.
When comparing two optimizers, vaguely speaking, one can normalize the learning rate \(\eta\) such that their “update sizes” match (e.g., make \(\eta C_{\text{upper}}\) or some measure of an average \(\eta C\) or some norm of the update equal for both). The comparison then boils down to which one has a larger \(C_{\text{lower}}\). The optimizer with the better worst-case guarantee (larger \(C_{\text{lower}}\)) is often preferable, as it is more reliable. For instance, our analysis for Schatten loss shows that the rate variation for Muon is \(\kappa(A)k^{1/p}\), while the corresponding (though potentially looser) bound for Adam/SignSGD includes additional factors, suggesting Muon’s convergence rate might be more stable (or just have better mathematical properties). In the case of \(p = 2\), though, the Adam convergence time bounds are tight (the Frobenius norm is both a Schatten-p norm as well as an element-wise norm, so we can pick \(A\) to be the matrices in the element-element analysis for SignSGD), so Muon is a better choice for this case. We can make this precise too, because after normalizing the learning rate with functions of the matrix size such that the Frobenius norm of the updates is reasonably constant on average (taken over all possible starting points \(X\)), the lower bounds on time become constant for both optimizers, but the upper bounds vary at different rates, so the worst case scenario of Muon is better than that for Adam.
Also, ideally, if you know the geometry of the problem, the choice of the optimizer should be directly derived from that. If not, it may be chosen in a way that does not have a lot of deviation at any given point from the actual geometry of the problem, in some sense. If the choice is made this way, then the spectral norm (and hence Muon) in general seems to be a good choice.
Further reading
On the topic of metrized deep learning, a good introduction is the survey here - I would highly recommend doing a breadth-first search through all the references mentioned there to understand the main ideas better.
After I wrote this post, someone pointed out this paper which also talks about Muon’s convergence rate (discrete steps, non-convex problems, but not necessarily finite-time) to local minima and compares it to gradient descent.