Hi!

This blog has some stuff I've written over the years - on topics like ML, math, algorithms, pedagogy, low-level programming, Linux, compilers and other things I like to talk about.

For any comments, reach out to me at nor [dot] xor [dot] nor at gmail [dot] com. I am also on Twitter (@norxornor), but might be less responsive there.

To subscribe via RSS, copypaste the link to the RSS page in your favourite RSS reader. Each tag also has its own RSS feed.

Some of my (mostly throwaway) code can be found at nor-git.pages.dev.

Signed

Why wide neural networks share the same loss-curve fine structure

An elementary finite-time account of why wide networks trained on the same minibatches develop matching local loss fluctuations, and how width and batch size control initialization, data, and interaction noise, with implications on scaling.

Remark #2: The Adam update

A short note on Adam's update size, the coupling of beta1 and beta2, bias correction, epsilon, and how these relate to stability and warmup.

Improving the lower bound for the unit distance problem

A refinement of Sawin's explicit unit-distance lower bound, mostly by GPT 5.5 Pro, for the Erdős problem recently solved by OpenAI's internal model.

Remark #1: On RMS matched Muon

A short note on why RMS-matching Muon to AdamW can break width transfer, causing either undertraining or instability depending on scale.

A short note on some aspects of long context attention

A research note on what breaks in long-context attention, deriving a logit scaling, with QK-norm, hybrid/local attention, gating, and small-scale experiments.

Simple Rules, Complex Dynamics – Part I: Foundations & Intuition

An intuition-building tour of dynamical systems, using canonical examples to connect feedback, thresholds, coupling, noise, reinforcement, spatial structure, and phase transitions.

The modded nanogpt speedrun, but in JAX and on TPUs

A writeup of porting the modded nanoGPT speedrun to pure JAX on TPU v6e, including hardware bottlenecks, bugs, optimizations, and open performance questions.

Theoretical properties of optimizers on a toy problem, and some intuition

A theoretical comparison of normalized gradient descent, Muon, and Adam-style updates on a tractable matrix optimization toy problem, showing finite-time convergence and why Muon's guarantees come out nicer than Adam's.

Deriving RoPE the proper way

A rigorous derivation showing RoPE is almost optimally expressive under certain natural constraints, characterizing the allowable positional rotations with a free N-dimensional generalization and constructions for the rotation vectors.

Solving the IMO 2025 problems

Proof sketches and commentary for the IMO 2025 problems, written after doing the contest as a mock.