machine-learning

Remark #2: The Adam update

A short note on Adam's update size, the coupling of beta1 and beta2, bias correction, epsilon, and how these relate to stability and warmup.

Remark #1: On RMS matched Muon

A short note on why RMS-matching Muon to AdamW can break width transfer, causing either undertraining or instability depending on scale.

A short note on some aspects of long context attention

A research note on what breaks in long-context attention, deriving a logit scaling, with QK-norm, hybrid/local attention, gating, and small-scale experiments.