Theoretical properties of optimizers on a toy problem, and some intuition
Table of Contents Introduction The Setting The Optimization Problem Subgradient of the Loss Function The Family of Geometric Optimizers A Convergence Theorem Convergence Bounds for Specific Scenarios Normalized Gradient Descent Muon Adam/SignSGD A Note on Bound Tightness and Rate Variation Analysis of Rate Variation Analysis of Bound Tightness Using This In Practice: A Comparison of Optimizers The Problem with Vanilla Gradient Descent Geometric Optimizers as the Solution Comparing the structure for Normalized Gradient Descent, Muon, and Adam/SignSGD The Nature of the Updates Building Intuition from the Convergence Bounds The Meaning of \(C_{\text{lower}}\) and \(C_{\text{upper}}\) Rate Stability and Comparing Optimizers Further reading Introduction We want to answer the question: how do optimizers fundamentally differ in their approach to finding a minimum? To explore this question in a controlled environment, we focus on a simple problem: minimizing the matrix error term \(\min \lVert AX - B \rVert\). ...