Hi!

This blog has some stuff I’ve written over the years - on topics like math, algorithms, pedagogy, low-level programming, Linux, compilers and other things I like to talk about.

For any comments, reach out to me at nor [dot] xor [dot] nor at gmail [dot] com. I am also on Twitter (@norxornor), but might be less responsive there.

To subscribe via RSS, copypaste the link to the RSS page in your favourite RSS reader. Each tag also has its own RSS feed.

Simple Rules, Complex Dynamics – Part I: Foundations & Intuition

Table of Contents Introduction and how to read this post Some simple ideas and baselines - growth, local linearity and equilibrium Exponential and logistic growth Oscillators and local linearity Invariants and conservation A quick primer on bifurcations Oscillations from feedback and delay Lotka-Volterra predator-prey model SEIR model with vital dynamics Thresholds, alignment, and phase transitions Granovetter’s threshold model for collective behavior The Ising model: alignment, noise and criticality Schelling’s model of segregation Reinforcement, herding, and heavy tails Polya’s urn: reinforcement and path dependence Kirman-Folmer herding: bistability Sandpiles: self-organized criticality and avalanches Multiplicative growth and Kesten tails Swarms and distributed coordination Vicsek alignment: headings, noise, and a mean-field self-consistency Cucker-Smale: continuous-time velocity alignment and convergence Hydrodynamics and long-range order Selection dynamics Replicator dynamics Replicator-mutator dynamics Ricardian trade as selection of specializations Spatial structure: patterns from local rules Reaction-diffusion and diffusion-driven instability Traveling fronts: Fisher-KPP Other patterns Some common threads A toolbox Main objects of study Linearization and local analysis 2D phase-plane portraits Lyapunov functions and LaSalle’s invariance Bifurcations Discrete-time systems and chaos Delays: compartments vs true delay differential equations Invariants, positivity, comparison, monotone structure Networks and coupling Non-dimensionalization and scaling Stochastic dynamics and metastability Traveling fronts Contraction analysis Singular perturbations and slow-fast decompositions Koopman operator and data-driven surrogates Ergodic theory and mixing Non-smooth and hybrid dynamics (thresholds and edges) Control and feedback Numerical methods that respect structure References and Further Reading Acknowledgements Introduction and how to read this post Simple local interactions can lead to a large variety of global behavior. Many examples come from physics, biology, economics, and ML, but the machinery used to look at them remains essentially the same. Looking at things through a certain lens - asking questions such as “What’s the feedback?”, “How are things coupled?”, “Is there noise?”, “What are the timescales?” - can help intuit the macroscopic outcomes of these systems: growth, cycles, alignment and discontinuities, swarms, heavy tails, and different spatio-temporal patterns. ...

The modded nanogpt speedrun, but in JAX and on TPUs

TL;DR: New speedrun here. Details below. I generally experiment in PyTorch on my local GPUs. Recently, I received some TPU credits from the TPU Research Cloud (TRC) program to do some research, which gave me a good reason to finally learn JAX. Near about all I knew was that you could often replace numpy with jax.numpy and get code to run on accelerators, and that JAX code is supposed to be functional. This was also the first time I wrote code for TPUs. With this, I set out on a small side project: porting the modded nanoGPT speedrun to pure JAX with the goal of achieving the best possible performance. ...

Theoretical properties of optimizers on a toy problem, and some intuition

Table of Contents Introduction The Setting The Optimization Problem Subgradient of the Loss Function The Family of Geometric Optimizers A Convergence Theorem Convergence Bounds for Specific Scenarios Normalized Gradient Descent Muon Adam/SignSGD A Note on Bound Tightness and Rate Variation Analysis of Rate Variation Analysis of Bound Tightness Using This In Practice: A Comparison of Optimizers The Problem with Vanilla Gradient Descent Geometric Optimizers as the Solution Comparing the structure for Normalized Gradient Descent, Muon, and Adam/SignSGD The Nature of the Updates Building Intuition from the Convergence Bounds The Meaning of \(C_{\text{lower}}\) and \(C_{\text{upper}}\) Rate Stability and Comparing Optimizers Further reading Introduction We want to answer the question: how do optimizers fundamentally differ in their approach to finding a minimum? To explore this question in a controlled environment, we focus on a simple problem: minimizing the matrix error term \(\min \lVert AX - B \rVert\). ...

Deriving RoPE the proper way

Figure 1: Attention score similarity with our 3D RoPE. These plots show how positional similarity changes across an axis. Introduction RoPE has become the de facto positional embedding for transformer models. Its popularity mainly stems from its performance, but the “derivation” in the paper is also quite elegant (but flawed). Implementing high dimensional RoPE also pushes us to think about generalizing the underlying ideas as much as possible (alongside using signal processing intuition) - there’s code at the end of the post that implements things based on the ideas we develop here. ...

Solving the IMO 2025 problems

So I’ve been solving IMO problems almost every year, ever since I was a math Olympiad contestant, either posting them to AoPS or just keeping the solutions to myself/discussing with other math Olympiad people. Usually, I used to solve the problems in a single sitting, but given the fact that I missed a couple of years at some point, and that I had a bit of time this time, I decided to do a proper IMO mock this time (spoiler: P4 left a bad enough taste that I didn’t want to really put in effort into P6). ...

Quantizing LLMs for inference

Motivation Let’s start by doing some arithmetic about large language models (LLMs). These are neural networks with huge parameter counts, with state-of-the-art open-weights models (i.e., ones you can download) having parameter counts of the order of 100B (\(10^{11}\)) or so (and usable ones around one order of magnitude smaller). Take the latest SOTA release Qwen 3 235B-A22B, for instance, which has roughly 235B parameters. If all these parameters were to be stored in a naive array of 32-bit (4 byte) floating point numbers, this model would require around 940 GB of storage as well as memory for a usable speed. Running this model purely on CPU with dual channel DDR4 RAM (which is likely the kind of RAM you have on your computer) would take you multiple seconds to output a single token/word (and even this is quite fast for the total size of the model because the architecture is what is called a Mixture of Experts, more on that later, so don’t worry yet). ...

A Math Academy review

Background and context Some background on me just so that you have a rough idea of where this review is coming from: I did Olympiads as a kid and have been involved in fairly math-heavy fields ever since, through university at least, depending on your definition of math-heavy. I’ve also been completely self-taught when it comes to the non-elementary-school math I know. So my plan on using my Math Academy subscription was to mostly brush up on things. I also did only the university-level courses. ...

Calibrating Confidence

If you’re here for the game, visit this. Calibration Everyone knows that being overconfident can often lead you to making reckless, unnecessarily aggressive decisions, and being underconfident leads to not taking enough opportunities. This duality shows up everywhere in real life - from aspects like investing/business, where real money is at stake, to more personal matters like career progression and navigating interpersonal relationships. If we were always right about things, we could blindly believe in ourselves, and overconfidence would not exist. If we were completely clueless (for some definition of completely clueless), we would be better off asking a stochastic parrot to take our decisions for us. ...

LLMs and dQw4w9WgXcQ

If you have been on the internet for a while, you probably know by now that LLMs (large language models) are a thing, and talking to them feels pretty close to talking to a human. If you have been on the internet for a decade or so, you can probably guess what this blog is going to be about. A harmless query… or was it? The task at hand was downloading an interesting video from YouTube that I happened to come across in my old link archive using the ever-handy yt-dlp. ...

The intuition and the math behind Simpson's paradox

Introduction When we reason qualitatively, we often rely on our intuition. However, intuition is often loaded with certain meta-biases that cloud our judgment; one of these biases comes into play when we think about “local” and “global” statements. What is a local or a global statement? One way of distinguishing between them is how many conditions we must guarantee hold before we can talk about the statement - so this terminology is relative. A local statement is a statement that uses many more conditions (a higher amount of specificity) than a global statement. For example, any statement about a certain group of people in the past few years is a local statement relative to a statement that is about all species that have ever existed on the earth. ...