Remark #2: The Adam update

May 29, 2026 · 13 min · 2619 words · nor

TL;DR

The original Adam paper has some math issues, and the update size depends more closely on Adam's β1\beta_1 and β2\beta_2, which are tightly coupled. We also discuss another fix to Adam's formulation that reduces the need for warmup, and briefly discuss loss spikes and ϵ\epsilon. These are all things that are known to people who work on optimization, but they are not often discussed among the broader community.

Introduction

Adam and AdamW have been the backbone of deep learning for a long time now, even as more advanced optimizers have started to displace them, at both small and large scales. There has been a lot of research, and there are many folklore tricks, about how to pick its hyperparameters, how they interact with the optimization process, and where instabilities arise. There is a lot of common knowledge on choosing AdamW's learning rate and weight decay, and on scaling/scheduling them (see, e.g., Loshchilov and Hutter, 2019; Yang et al., 2022; Yang et al. 2023; Everett et al., 2024; Defazio, 2025), and β1\beta_1, β2\beta_2, and ϵ\epsilon do not get enough attention. Without going too much into the details, we'll look at a couple of aspects of these parameters, and point to further references. We'll ignore weight decay (so Adam and AdamW are equivalent for our purposes), because that's a whole other topic that deserves its own post/book.

Update size bounds and the coupling of β1\beta_1 and β2\beta_2

If one reads the Adam paper, it might seem like β1\beta_1 and β2\beta_2 are somewhat independent parameters. However, this is not the case, and we can obtain some clues if we look at the derivations a bit closely.

The first issue is with the bounds on the update in Section 2. The paper says that because |E[g]|E[g2]|E[g]| \le \sqrt{E[g^2]}, the update will be upper-bounded by 11. However, this is only true when both expectations are taken with respect to the same weighting distribution over the gradient history, which is not the case for β1β2\beta_1 \ne \beta_2. It also considers specific cases of sparse gradients and uses them as the basis for worst case bounds, which is incorrect. So, as they are, the paper's bounds on the update size are invalid.

In fact, there is a coupling between β1\beta_1 and β2\beta_2 for Adam. We show one of these instances of coupling below: if β12β2\beta_1^2 \ge \beta_2, there exist cases where the update magnitude itself can blow up over time.

We can bound the update by looking at the gradient history of the optimization process. First, the update for a single parameter at step tt (ignoring the learning rate and ϵ\epsilon) looks like

Ut=1β2t1β21β11β1ti=0t1gtiβ1ii=0t1gti2β2i U_t = \sqrt{\frac{1 - \beta_2^t}{1 - \beta_2}} \cdot \frac{1 - \beta_1}{1 - \beta_1^t} \cdot \frac{\sum_{i = 0}^{t-1} g_{t - i} \beta_1^i}{\sqrt{\sum_{i = 0}^{t-1} g_{t - i}^2 \beta_2^i}}

Now, applying Cauchy-Schwarz to the sequences (β12β2)i\left\langle \left(\frac{\beta_1^2}{\beta_2}\right)^i \right\rangle and gti2β2i\langle g_{t-i}^2 \beta_2^i \rangle, and noting that we can upper-bound by the case where the gradient values gig_i all have the same sign, we get an achievable upper bound on the update as follows (for β12β2\beta_1^2 \ne \beta_2):

|Ut|1β2t1β21β11β1t1(β12β2)t1(β12β2)|U_t| \le \sqrt{\frac{1 - \beta_2^t}{1 - \beta_2}} \cdot \frac{1 - \beta_1}{1 - \beta_1^t} \cdot \sqrt{\frac{1 - \left(\frac{\beta_1^2}{\beta_2}\right)^t}{1 - \left(\frac{\beta_1^2}{\beta_2}\right)}}

We can immediately read off two things from this bound.

First, β1\beta_1 and β2\beta_2 are coupled: if β1\beta_1 is too large relative to some bound on β2\beta_2, the update size can blow up exponentially (and lead to instabilities) if you do not have an exponentially decaying learning rate schedule.

Second, the bounds at steady state are not the same as claimed in the paper. Also, at steady state, the tightest bound we get is from β1=β2\beta_1 = \beta_2 (see Su, 2026 for an interpretation).

A caveat is that in the equality case, we have gradients gt(β2/β1)tg_t \propto \left(\beta_2 / \beta_1\right)^t, so gradients are exponentially small. As long as we're in that region, these steps are probably fine, unless an update takes us to a region with a more reasonable gradient, which would likely mean a loss spike. This equality case is likely too pathological, but suffices to illustrate the point.

Note that the original paper didn't have this corrected factor, so its discussion of coupling is indirect at best. Its convergence bounds use decaying β1\beta_1, and assume β12<β2\beta_1^2 < \sqrt{\beta_2}, as well as the assumption that parameter differences don't grow unboundedly, but I don't think the paper's convergence bounds are very useful.

Choosing β1\beta_1 and β2\beta_2

The earlier discussion theoretically motivated the idea of coupling between β1\beta_1 and β2\beta_2. There are some optimizers, like LaProp, that try to decouple them by tricks such as applying momentum to the preconditioned gradient (i.e., the RMSProp update) instead of preconditioning the momentum, but this optimizer is typically not used.

Training setups typically use β1<β2\beta_1 < \beta_2, with some gap between them, which is a stronger condition than β12<β2\beta_1^2 < \beta_2 - the one where update sizes don't blow up - when β2\beta_2 is fixed. The latter is also seen in a bunch of papers that try to study Adam convergence as well as online learning properties of Adam, e.g. Nguyen (2025).

A heuristic argument is that gradient magnitudes reducing is what the denominator is sensitive to, and in this case, if the momentum is stale, it can lead to larger steps (at least in a transient state; for a toy example, consider what happens when the gradient jumps from a steady state of gg to λg\lambda g for the next kk steps, for λ<1\lambda < 1). So the momentum timescale should typically be lower than the second-moment timescale.

Some papers/blog posts recommend choosing β1=β2\beta_1 = \beta_2 (see Orvieto and Gower, 2025; Su, 2026), but practitioners report loss spikes in some setups, especially at scale, so I tend to avoid this. It is possible that this is optimal and doesn't lead to loss spikes under better training setups.

However, for the sake of completeness, this is not the only loss spike pathology due to Adam betas. In fact, a much more commonly seen Adam-specific one is due to β2\beta_2 being too large, partly because PyTorch's defaults are (β1,β2)=(0.9,0.999)(\beta_1, \beta_2) = (0.9, 0.999). As long as β1\beta_1 is smaller than β2\beta_2 with some gap, a stale second-order estimate due to large β2\beta_2 has often been identified as the origin of these spikes (see Shazeer and Stern, 2018; Wortsman et al., 2023; Cohen et al., 2022; Cohen et al., 2025; Bai et al., 2026). The β2\beta_2 effect in the case of β1=0\beta_1 = 0 (more specifically, full-batch RMSProp) was studied in Central Flows, which provides a semi-theoretical justification for the empirically known fact that too large of a β2\beta_2 can lead to loss spikes (Cohen et al., 2025), even in the absence of gradient noise. It is also worth noting that larger β2\beta_2 tends to help in later stages of training in terms of optimization performance. Adafactor uses a β2\beta_2 schedule that gradually increases over time alongside update clipping to avoid instabilities (Shazeer and Stern, 2018). We defer more discussion of other kinds of loss spikes to papers in the references.

Tangentially, this is in line with the philosophical argument that heavy-handed interventions that help stability usually lead to suboptimal solutions, but it is also important to note that with loss spikes that persist for more than a single step, the damage to the network can be irreversible and usually leads to worse final models, which is why it is important to avoid persistent loss spikes.

Note that the ββ-s also interact with other hyperparameters, such as the learning rate. For instance, as far as edge of stability is concerned, the relation between learning rate and preconditioned sharpness changes by constants related to the ββ-s. When gradient clipping is done (which is done for stability reasons, but in my opinion is a sign of other issues), it also changes the meaning of momentum and second order estimates.

Bias "correction"

The Adam paper does a bias "correction" for both the momentum and the second-moment estimate. The latter is important, but it is valid to question whether it is needed for the former.

The Adafactor paper observes that bias correction can be seen as an instance of an effective schedule for the betas, both of which monotonically increase, and depend on the original betas, starting from 00. This shows that earlier data points are given a larger weight than later points, and hence contribute much more to how the model is updated.

So, Adam's bias correction for momentum weighs each step the same, while a more plausible desideratum might be to weigh each data point the same, e.g., by using the contribution of each batch's gradient as a proxy. This suggests not applying the correction for momentum at all, since it reduces the asymmetry between gradient contributions of a sample in the beginning and a sample after the correction term effectively becomes 11, which is around Θ(1/logβ1)\Theta(-1/\log \beta_1) steps. A typical warmup may also take into account the longer β2\beta_2 timescale, in order to wait for the second-moment estimate to stabilize (so, e.g., of the order of 11β2\frac{1}{1 - \beta_2} steps).

Note that the above discussion is only important for the first few steps, and is hence potentially relevant to the warmup phase. A more concrete diagnostic that is helpful for understanding the necessity of warmup is related to the speed of "representation change", and is discussed in Kosson, Messmer, and Jaggi (2024). They also find that Adam's β1\beta_1 bias correction is at least partly the reason behind warmup being necessary, among other issues such as small critical batch size near the beginning of training.

On ϵ\epsilon

Nado (2019) argues that ϵ\epsilon is typically considered a nuisance parameter, but it is quite important for Adam, going beyond a mere numerical stability detail. It can, in certain cases, behave like a regularization parameter and can also let us interpolate between SGD and Adam regimes, and large ϵ\epsilon values seem to be commonly needed in classical RL. However, there are also other important reasons why it needs to be tuned, and in LLM training/scaling recipes, using a smaller ϵ\epsilon than the defaults usually helps.

It is also important when it comes to hyperparameter transfer, and pathologies due to ϵ\epsilon sometimes only emerge at scale. For example, hyperparameter transfer for the mean-field parameterization breaks, and gets poor performance, due to ϵ\epsilon not being scaled appropriately in order to avoid pathologies due to gradient underflow (Everett et al., 2024). There are optimizers such as Adam-atan2 that try to get rid of this parameter completely (Everett et al., 2024).

Tuning this parameter can lead to surprising gains in performance. There is evidence of this hyperparameter being tuned for good performance, e.g. with the modded-nanogpt speedrun using ϵ=1010\epsilon = 10^{-10} and RWKV7 using ϵ=1018\epsilon = 10^{-18}, with the PyTorch default being 10810^{-8}.

References and acknowledgements

Acknowledgements Thanks to Kevin Yin for discussing β2\beta_2 related Adam loss spikes and Adam bias correction in the EleutherAI discord server.
  • Diederik P. Kingma and Jimmy Ba. "Adam: A Method for Stochastic Optimization." ICLR 2015. arXiv:1412.6980.
  • Ilya Loshchilov and Frank Hutter. "Decoupled Weight Decay Regularization." ICLR 2019. arXiv:1711.05101.
  • Greg Yang, Edward J. Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. "Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer." NeurIPS 2021. arXiv:2203.03466.
  • Greg Yang, James B. Simon, Jeremy Bernstein. "A Spectral Condition for Feature Learning." arXiv:2310.17813.
  • Katie Everett, Lechao Xiao, Mitchell Wortsman, Alexander A. Alemi, Roman Novak, Peter J. Liu, Izzeddin Gur, Jascha Sohl-Dickstein, Leslie Pack Kaelbling, Jaehoon Lee, and Jeffrey Pennington. "Scaling Exponents Across Parameterizations and Optimizers." ICML 2024. arXiv:2407.05872.
  • Liu Ziyin, Zhikang T. Wang, and Masahito Ueda. "LaProp: Separating Momentum and Adaptivity in Adam." arXiv:2002.04839.
  • Noam Shazeer and Mitchell Stern. "Adafactor: Adaptive Learning Rates with Sublinear Memory Cost." ICML 2018. arXiv:1804.04235.
  • Antonio Orvieto and Robert M. Gower. "In Search of Adam's Secret Sauce." NeurIPS 2025. arXiv:2505.21829.
  • Su Jianlin. "Adam优化器的最优超参数是β1=β2\beta_1=\beta_2?" Scientific Spaces / 科学空间, 2026.
  • Mitchell Wortsman, Tim Dettmers, Luke Zettlemoyer, Ari Morcos, Ali Farhadi, and Ludwig Schmidt. "Stable and Low-Precision Training for Large-Scale Vision-Language Models." NeurIPS 2023. arXiv:2304.13013.
  • Jeremy M. Cohen, Behrooz Ghorbani, Shankar Krishnan, Naman Agarwal, Sourabh Medapati, Michal Badura, Daniel Suo, David Cardoze, Zachary Nado, George E. Dahl, and Justin Gilmer. "Adaptive Gradient Methods at the Edge of Stability." arXiv:2207.14484.
  • Jeremy M. Cohen, Alex Damian, Ameet Talwalkar, Zico Kolter, and Jason D. Lee. "Understanding Optimization in Deep Learning with Central Flows." ICLR 2025. arXiv:2410.24206.
  • Zhiwei Bai, Zhangchen Zhou, Jiajie Zhao, Xiaolong Li, Zhiyu Li, Feiyu Xiong, Hongkang Yang, Yaoyu Zhang, and Zhi-Qin John Xu. "Adaptive Preconditioners Trigger Loss Spikes in Adam." ICML 2026.
  • Atli Kosson, Bettina Messmer, and Martin Jaggi. "Analyzing & Reducing the Need for Learning Rate Warmup in GPT Training." NeurIPS 2024. arXiv:2410.23922.
  • John St John. "AdamD: Improved Bias-Correction in Adam." arXiv:2110.10828.
  • Zachary Nado. "ϵ\epsilon, A Nuisance No More." Blog post, 2019.
  • Keller Jordan et al. "modded-nanogpt: Speedrunning the NanoGPT Baseline." GitHub repository.
  • RWKV project.
  • PyTorch documentation. "torch.optim.Adam" and "torch.optim.AdamW."
  • Su Jianlin. "为什么Adam的Update RMS是0.2?" Scientific Spaces / 科学空间, 2025.
  • Su Jianlin. "重新思考学习率与Batch Size(四):EMA." Scientific Spaces / 科学空间, 2025.
  • Su Jianlin. "AdamW的Weight RMS的渐近估计(上)." Scientific Spaces / 科学空间, 2025.
  • Su Jianlin. "AdamW的Weight RMS的渐近估计(下)." Scientific Spaces / 科学空间, 2025.
  • Aaron Defazio. "Why Gradients Rapidly Increase Near the End of Training." arXiv:2506.02285, 2025.
  • Quan Nguyen. "How to Set β1,β2\beta_1, \beta_2 in Adam: An Online Learning Perspective." ALT 2026. arXiv:2510.03478.
  • Igor Molybog, Peter Albert, Moya Chen, Zachary DeVito, David Esiobu, Naman Goyal, Punit Singh Koura, Sharan Narang, Andrew Poulton, Ruan Silva, Binh Tang, Diana Liskovich, Puxin Xu, Yuchen Zhang, Melanie Kambadur, Stephen Roller, Susan Zhang. "A Theory on Adam Instability in Large-Scale Machine Learning." arXiv:2304.09871.
  • Jerry Ma, Denis Yarats. "On the adequacy of untuned warmup for adaptive optimization." arXiv:1910.04209
Cite this post
@online{adam-update,
  author    = {nor},
  title     = {Remark #2: The Adam update},
  year      = {2026},
  month     = {05},
  day       = {29},
  url       = {https://nor-blog.pages.dev/posts/2026-05-29-adam-update/},
}