machine-learning | nor's blog

Why wide neural networks share the same loss-curve fine structure

An elementary finite-time account of why wide networks trained on the same minibatches develop matching local loss fluctuations, and how width and batch size control initialization, data, and interaction noise, with implications on scaling.

Remark #2: The Adam update

A short note on Adam's update size, the coupling of beta1 and beta2, bias correction, epsilon, and how these relate to stability and warmup.

Remark #1: On RMS matched Muon

A short note on why RMS-matching Muon to AdamW can break width transfer, causing either undertraining or instability depending on scale.

A short note on some aspects of long context attention

A research note on what breaks in long-context attention, deriving a logit scaling, with QK-norm, hybrid/local attention, gating, and small-scale experiments.