Remark #2: The Adam update
A short note on Adam's update size, the coupling of beta1 and beta2, bias correction, epsilon, and how these relate to stability and warmup.
A short note on Adam's update size, the coupling of beta1 and beta2, bias correction, epsilon, and how these relate to stability and warmup.
A refinement of Sawin's explicit unit-distance lower bound, mostly by GPT 5.5 Pro, for the Erdős problem recently solved by OpenAI's internal model.
A short note on why RMS-matching Muon to AdamW can break width transfer, causing either undertraining or instability depending on scale.
A research note on what breaks in long-context attention, deriving a logit scaling, with QK-norm, hybrid/local attention, gating, and small-scale experiments.
An intuition-building tour of dynamical systems, using canonical examples to connect feedback, thresholds, coupling, noise, reinforcement, spatial structure, and phase transitions.
A writeup of porting the modded nanoGPT speedrun to pure JAX on TPU v6e, including hardware bottlenecks, bugs, optimizations, and open performance questions.
A theoretical comparison of normalized gradient descent, Muon, and Adam-style updates on a tractable matrix optimization toy problem, showing finite-time convergence and why Muon's guarantees come out nicer than Adam's.
A rigorous derivation showing RoPE is almost optimally expressive under certain natural constraints, characterizing the allowable positional rotations with a free N-dimensional generalization and constructions for the rotation vectors.
Proof sketches and commentary for the IMO 2025 problems, written after doing the contest as a mock.
A practical overview of LLM inference quantization, from why memory bandwidth dominates local inference to GGUF, EXL, AWQ, GPTQ, KV-cache quants, and hardware tradeoffs.