Theoretical properties of optimizers on a toy problem, and some intuition

Table of Contents Introduction The Setting The Optimization Problem Subgradient of the Loss Function The Family of Geometric Optimizers A Convergence Theorem Convergence Bounds for Specific Scenarios Normalized Gradient Descent Muon Adam/SignSGD A Note on Bound Tightness and Rate Variation Analysis of Rate Variation Analysis of Bound Tightness Using This In Practice: A Comparison of Optimizers The Problem with Vanilla Gradient Descent Geometric Optimizers as the Solution Comparing the structure for Normalized Gradient Descent, Muon, and Adam/SignSGD The Nature of the Updates Building Intuition from the Convergence Bounds The Meaning of \(C_{\text{lower}}\) and \(C_{\text{upper}}\) Rate Stability and Comparing Optimizers Further reading Introduction We want to answer the question: how do optimizers fundamentally differ in their approach to finding a minimum? To explore this question in a controlled environment, we focus on a simple problem: minimizing the matrix error term \(\min \lVert AX - B \rVert\). ...

August 2, 2025 · 33 min · 6841 words · nor

Deriving RoPE the proper way

Figure 1: Attention score similarity with our 3D RoPE. These plots show how positional similarity changes across an axis. Introduction RoPE has become the de facto positional embedding for transformer models. Its popularity mainly stems from its performance, but the “derivation” in the paper is also quite elegant (but flawed). Implementing high dimensional RoPE also pushes us to think about generalizing the underlying ideas as much as possible (alongside using signal processing intuition) - there’s code at the end of the post that implements things based on the ideas we develop here. ...

July 28, 2025 · 20 min · 4050 words · nor

Solving the IMO 2025 problems

So I’ve been solving IMO problems almost every year, ever since I was a math Olympiad contestant, either posting them to AoPS or just keeping the solutions to myself/discussing with other math Olympiad people. Usually, I used to solve the problems in a single sitting, but given the fact that I missed a couple of years at some point, and that I had a bit of time this time, I decided to do a proper IMO mock this time (spoiler: P4 left a bad enough taste that I didn’t want to really put in effort into P6). ...

July 19, 2025 · 15 min · 3004 words · nor

Quantizing LLMs for inference

Motivation Let’s start by doing some arithmetic about large language models (LLMs). These are neural networks with huge parameter counts, with state-of-the-art open-weights models (i.e., ones you can download) having parameter counts of the order of 100B (\(10^{11}\)) or so (and usable ones around one order of magnitude smaller). Take the latest SOTA release Qwen 3 235B-A22B, for instance, which has roughly 235B parameters. If all these parameters were to be stored in a naive array of 32-bit (4 byte) floating point numbers, this model would require around 940 GB of storage as well as memory for a usable speed. Running this model purely on CPU with dual channel DDR4 RAM (which is likely the kind of RAM you have on your computer) would take you multiple seconds to output a single token/word (and even this is quite fast for the total size of the model because the architecture is what is called a Mixture of Experts, more on that later, so don’t worry yet). ...

May 14, 2025 · 31 min · 6410 words · nor

A Math Academy review

Background and context Some background on me just so that you have a rough idea of where this review is coming from: I did Olympiads as a kid and have been involved in fairly math-heavy fields ever since, through university at least, depending on your definition of math-heavy. I’ve also been completely self-taught when it comes to the non-elementary-school math I know. So my plan on using my Math Academy subscription was to mostly brush up on things. I also did only the university-level courses. ...

April 16, 2025 · 10 min · 1948 words · nor

The intuition and the math behind Simpson's paradox

Introduction When we reason qualitatively, we often rely on our intuition. However, intuition is often loaded with certain meta-biases that cloud our judgment; one of these biases comes into play when we think about “local” and “global” statements. What is a local or a global statement? One way of distinguishing between them is how many conditions we must guarantee hold before we can talk about the statement - so this terminology is relative. A local statement is a statement that uses many more conditions (a higher amount of specificity) than a global statement. For example, any statement about a certain group of people in the past few years is a local statement relative to a statement that is about all species that have ever existed on the earth. ...

October 2, 2024 · 12 min · 2492 words · nor

Implementing FFT

The other day I had a discussion with someone about how to implement FFT-based convolution/polynomial multiplication - they were having a hard time squeezing their library implementation into the time limit on this problem, and it soon turned into a discussion on how to optimize it as much as possible. It turned out that the bit-reversing part of their iterative implementation was taking a pretty large amount of time, so I suggested not using bit-reversal at all, as is done in a few libraries. Since not a lot of people turned out to be familiar with it, I decided to write a post on some ways of implementing FFT and deriving these ways from one another. ...

June 1, 2024 · 19 min · 4001 words · nor

An elementary way of solving recurrences

Introduction A lot of people shy away from solving (mathematical) recurrences just because the theory is not very clear/approachable due to not having an advanced background in math. As a consequence, the usual ways of solving recurrences tend to be: Find the first few terms on OEIS. Guess terms from the rate of growth of the recurrence (exponential rate of growth means you can sometimes estimate the exponential terms going from largest to smallest — though this fails in cases where there is a double-root of the characteristic equation) Use some theorem whose validity you can’t prove (the so-called characteristic equation method) Overkill using generating functions But this doesn’t have to be the case, because there is a nice method you can apply to solve equations reliably. I independently came up with this method back when I was in middle school, and surprisingly a lot of people have no idea that you can solve recurrences like this. ...

January 7, 2024 · 10 min · 1932 words · nor

The Akra-Bazzi theorem - a generalization of the master theorem for recurrences

Motivation On a computer science discord server, someone recently asked the following question: Is the master theorem applicable for the following recurrence? \(T(n) = 7T(\lfloor n / 20 \rfloor) + 2T(\lfloor n / 8 \rfloor) + n\) There was some discussion related to how \(T\) is monotonically increasing (which is hard to prove), and then someone claimed that there is a solution using induction for a better bound. However, these ad hoc solutions often require some guessing and the Master theorem is not directly applicable. ...

December 19, 2023 · 4 min · 669 words · nor

On lambdas, C++ and otherwise: the what, the why, and the how

The contents are as follows: Introduction Why you should use lambdas Some context Hand-rolling our own lambdas C++ lambda syntax explained Using lambdas Using lambdas with the STL Some useful non-trivial patterns Some other interesting patterns Examples of competitive programming code using C++ lambdas Prerequisites: knowing a bit about structs/classes in C++, member functions, knowing that the STL exists. If you feel something is not very clear, I recommend waiting for a bit, because I tried to ensure that all important ideas are explained at some point or another, and if you don’t understand and it doesn’t pop up later, it is probably not that important (and should not harm the experience of reading this post). Nevertheless, if I missed out on explaining something that looks important, please ask me in the comments — I’d be happy to answer your questions! ...

December 2, 2023 · 47 min · 9881 words · nor
>