Machine-Learning

Table of Contents Introduction Meta Scoping What goes wrong at long context Training stability and QK-norm Looking at the distribution The Gaussian assumption and \(\sqrt{2 \log n}\) The Beta assumption and \(n^{2 / (d - 1)}\) Local vs global behavior and inductive biases Similar existing literature Scalable softmax Position-dependent scaling and scale-invariant attention Positional encodings and hybrid attention (local and global) Attention sinks and gating Revisiting QK-norm and norm information Experimental details Acknowledgements References Final notes Introduction Meta One of my recent side-quests is to understand which architectural choices for models scale well, to ensure that future work that interests me remains meaningful. ...