Remark #2: The Adam update

May 29, 2026 · 13 min · 2619 words · nor

Table of Contents

TL;DR
Introduction
Update size bounds and the coupling of $\beta_1$ and $\beta_2$
Choosing $\beta_1$ and $\beta_2$
Bias "correction"
On $\epsilon$
References and acknowledgements

TL;DR

The original Adam paper has some math issues, and the update size depends more closely on Adam's $\beta_1$ and $\beta_2$ , which are tightly coupled. We also discuss another fix to Adam's formulation that reduces the need for warmup, and briefly discuss loss spikes and $\epsilon$ . These are all things that are known to people who work on optimization, but they are not often discussed among the broader community.

Introduction

Adam and AdamW have been the backbone of deep learning for a long time now, even as more advanced optimizers have started to displace them, at both small and large scales. There has been a lot of research, and there are many folklore tricks, about how to pick its hyperparameters, how they interact with the optimization process, and where instabilities arise. There is a lot of common knowledge on choosing AdamW's learning rate and weight decay, and on scaling/scheduling them (see, e.g., Loshchilov and Hutter, 2019; Yang et al., 2022; Yang et al. 2023; Everett et al., 2024; Defazio, 2025), and $\beta_1$ , $\beta_2$ , and $\epsilon$ do not get enough attention. Without going too much into the details, we'll look at a couple of aspects of these parameters, and point to further references. We'll ignore weight decay (so Adam and AdamW are equivalent for our purposes), because that's a whole other topic that deserves its own post/book.

Update size bounds and the coupling of $\beta_1$ and $\beta_2$

If one reads the Adam paper, it might seem like $\beta_1$ and $\beta_2$ are somewhat independent parameters. However, this is not the case, and we can obtain some clues if we look at the derivations a bit closely.

The first issue is with the bounds on the update in Section 2. The paper says that because $|E[g]| \le \sqrt{E[g^2]}$ , the update will be upper-bounded by $1$ . However, this is only true when both expectations are taken with respect to the same weighting distribution over the gradient history, which is not the case for $\beta_1 \ne \beta_2$ . It also considers specific cases of sparse gradients and uses them as the basis for worst case bounds, which is incorrect. So, as they are, the paper's bounds on the update size are invalid.

In fact, there is a coupling between $\beta_1$ and $\beta_2$ for Adam. We show one of these instances of coupling below: if $\beta_1^2 \ge \beta_2$ , there exist cases where the update magnitude itself can blow up over time.

We can bound the update by looking at the gradient history of the optimization process. First, the update for a single parameter at step $t$ (ignoring the learning rate and $\epsilon$ ) looks like

$U_t = \sqrt{\frac{1 - \beta_2^t}{1 - \beta_2}} \cdot \frac{1 - \beta_1}{1 - \beta_1^t} \cdot \frac{\sum_{i = 0}^{t-1} g_{t - i} \beta_1^i}{\sqrt{\sum_{i = 0}^{t-1} g_{t - i}^2 \beta_2^i}}$

Now, applying Cauchy-Schwarz to the sequences $\left\langle \left(\frac{\beta_1^2}{\beta_2}\right)^i \right\rangle$ and $\langle g_{t-i}^2 \beta_2^i \rangle$ , and noting that we can upper-bound by the case where the gradient values $g_i$ all have the same sign, we get an achievable upper bound on the update as follows (for $\beta_1^2 \ne \beta_2$ ):

$|U_t| \le \sqrt{\frac{1 - \beta_2^t}{1 - \beta_2}} \cdot \frac{1 - \beta_1}{1 - \beta_1^t} \cdot \sqrt{\frac{1 - \left(\frac{\beta_1^2}{\beta_2}\right)^t}{1 - \left(\frac{\beta_1^2}{\beta_2}\right)}}$

We can immediately read off two things from this bound.

First, $\beta_1$ and $\beta_2$ are coupled: if $\beta_1$ is too large relative to some bound on $\beta_2$ , the update size can blow up exponentially (and lead to instabilities) if you do not have an exponentially decaying learning rate schedule.

Second, the bounds at steady state are not the same as claimed in the paper. Also, at steady state, the tightest bound we get is from $\beta_1 = \beta_2$ (see Su, 2026 for an interpretation).

A caveat is that in the equality case, we have gradients $g_t \propto \left(\beta_2 / \beta_1\right)^t$ , so gradients are exponentially small. As long as we're in that region, these steps are probably fine, unless an update takes us to a region with a more reasonable gradient, which would likely mean a loss spike. This equality case is likely too pathological, but suffices to illustrate the point.

Note that the original paper didn't have this corrected factor, so its discussion of coupling is indirect at best. Its convergence bounds use decaying $\beta_1$ , and assume $\beta_1^2 < \sqrt{\beta_2}$ , as well as the assumption that parameter differences don't grow unboundedly, but I don't think the paper's convergence bounds are very useful.

Choosing $\beta_1$ and $\beta_2$

The earlier discussion theoretically motivated the idea of coupling between $\beta_1$ and $\beta_2$ . There are some optimizers, like LaProp, that try to decouple them by tricks such as applying momentum to the preconditioned gradient (i.e., the RMSProp update) instead of preconditioning the momentum, but this optimizer is typically not used.

Training setups typically use $\beta_1 < \beta_2$ , with some gap between them, which is a stronger condition than $\beta_1^2 < \beta_2$ - the one where update sizes don't blow up - when $\beta_2$ is fixed. The latter is also seen in a bunch of papers that try to study Adam convergence as well as online learning properties of Adam, e.g. Nguyen (2025).

A heuristic argument is that gradient magnitudes reducing is what the denominator is sensitive to, and in this case, if the momentum is stale, it can lead to larger steps (at least in a transient state; for a toy example, consider what happens when the gradient jumps from a steady state of $g$ to $\lambda g$ for the next $k$ steps, for $\lambda < 1$ ). So the momentum timescale should typically be lower than the second-moment timescale.

Some papers/blog posts recommend choosing $\beta_1 = \beta_2$ (see Orvieto and Gower, 2025; Su, 2026), but practitioners report loss spikes in some setups, especially at scale, so I tend to avoid this. It is possible that this is optimal and doesn't lead to loss spikes under better training setups.

However, for the sake of completeness, this is not the only loss spike pathology due to Adam betas. In fact, a much more commonly seen Adam-specific one is due to $\beta_2$ being too large, partly because PyTorch's defaults are $(\beta_1, \beta_2) = (0.9, 0.999)$ . As long as $\beta_1$ is smaller than $\beta_2$ with some gap, a stale second-order estimate due to large $\beta_2$ has often been identified as the origin of these spikes (see Shazeer and Stern, 2018; Wortsman et al., 2023; Cohen et al., 2022; Cohen et al., 2025; Bai et al., 2026). The $\beta_2$ effect in the case of $\beta_1 = 0$ (more specifically, full-batch RMSProp) was studied in Central Flows, which provides a semi-theoretical justification for the empirically known fact that too large of a $\beta_2$ can lead to loss spikes (Cohen et al., 2025), even in the absence of gradient noise. It is also worth noting that larger $\beta_2$ tends to help in later stages of training in terms of optimization performance. Adafactor uses a $\beta_2$ schedule that gradually increases over time alongside update clipping to avoid instabilities (Shazeer and Stern, 2018). We defer more discussion of other kinds of loss spikes to papers in the references.

Tangentially, this is in line with the philosophical argument that heavy-handed interventions that help stability usually lead to suboptimal solutions, but it is also important to note that with loss spikes that persist for more than a single step, the damage to the network can be irreversible and usually leads to worse final models, which is why it is important to avoid persistent loss spikes.

Note that the $β$ -s also interact with other hyperparameters, such as the learning rate. For instance, as far as edge of stability is concerned, the relation between learning rate and preconditioned sharpness changes by constants related to the $β$ -s. When gradient clipping is done (which is done for stability reasons, but in my opinion is a sign of other issues), it also changes the meaning of momentum and second order estimates.

Bias "correction"

The Adam paper does a bias "correction" for both the momentum and the second-moment estimate. The latter is important, but it is valid to question whether it is needed for the former.

The Adafactor paper observes that bias correction can be seen as an instance of an effective schedule for the betas, both of which monotonically increase, and depend on the original betas, starting from $0$ . This shows that earlier data points are given a larger weight than later points, and hence contribute much more to how the model is updated.

So, Adam's bias correction for momentum weighs each step the same, while a more plausible desideratum might be to weigh each data point the same, e.g., by using the contribution of each batch's gradient as a proxy. This suggests not applying the correction for momentum at all, since it reduces the asymmetry between gradient contributions of a sample in the beginning and a sample after the correction term effectively becomes $1$ , which is around $\Theta(-1/\log \beta_1)$ steps. A typical warmup may also take into account the longer $\beta_2$ timescale, in order to wait for the second-moment estimate to stabilize (so, e.g., of the order of $\frac{1}{1 - \beta_2}$ steps).

Note that the above discussion is only important for the first few steps, and is hence potentially relevant to the warmup phase. A more concrete diagnostic that is helpful for understanding the necessity of warmup is related to the speed of "representation change", and is discussed in Kosson, Messmer, and Jaggi (2024). They also find that Adam's $\beta_1$ bias correction is at least partly the reason behind warmup being necessary, among other issues such as small critical batch size near the beginning of training.

On $\epsilon$

Nado (2019) argues that $\epsilon$ is typically considered a nuisance parameter, but it is quite important for Adam, going beyond a mere numerical stability detail. It can, in certain cases, behave like a regularization parameter and can also let us interpolate between SGD and Adam regimes, and large $\epsilon$ values seem to be commonly needed in classical RL. However, there are also other important reasons why it needs to be tuned, and in LLM training/scaling recipes, using a smaller $\epsilon$ than the defaults usually helps.

It is also important when it comes to hyperparameter transfer, and pathologies due to $\epsilon$ sometimes only emerge at scale. For example, hyperparameter transfer for the mean-field parameterization breaks, and gets poor performance, due to $\epsilon$ not being scaled appropriately in order to avoid pathologies due to gradient underflow (Everett et al., 2024). There are optimizers such as Adam-atan2 that try to get rid of this parameter completely (Everett et al., 2024).

Tuning this parameter can lead to surprising gains in performance. There is evidence of this hyperparameter being tuned for good performance, e.g. with the modded-nanogpt speedrun using $\epsilon = 10^{-10}$ and RWKV7 using $\epsilon = 10^{-18}$ , with the PyTorch default being $10^{-8}$ .

References and acknowledgements

Acknowledgements

Thanks to Kevin Yin for discussing

\beta_2

related Adam loss spikes and Adam bias correction in the EleutherAI discord server.

Diederik P. Kingma and Jimmy Ba. "Adam: A Method for Stochastic Optimization." ICLR 2015. arXiv:1412.6980.
Ilya Loshchilov and Frank Hutter. "Decoupled Weight Decay Regularization." ICLR 2019. arXiv:1711.05101.
Greg Yang, Edward J. Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. "Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer." NeurIPS 2021. arXiv:2203.03466.
Greg Yang, James B. Simon, Jeremy Bernstein. "A Spectral Condition for Feature Learning." arXiv:2310.17813.
Katie Everett, Lechao Xiao, Mitchell Wortsman, Alexander A. Alemi, Roman Novak, Peter J. Liu, Izzeddin Gur, Jascha Sohl-Dickstein, Leslie Pack Kaelbling, Jaehoon Lee, and Jeffrey Pennington. "Scaling Exponents Across Parameterizations and Optimizers." ICML 2024. arXiv:2407.05872.
Liu Ziyin, Zhikang T. Wang, and Masahito Ueda. "LaProp: Separating Momentum and Adaptivity in Adam." arXiv:2002.04839.
Noam Shazeer and Mitchell Stern. "Adafactor: Adaptive Learning Rates with Sublinear Memory Cost." ICML 2018. arXiv:1804.04235.
Antonio Orvieto and Robert M. Gower. "In Search of Adam's Secret Sauce." NeurIPS 2025. arXiv:2505.21829.
Su Jianlin. "Adam优化器的最优超参数是 $\beta_1=\beta_2$ ？" Scientific Spaces / 科学空间, 2026.
Mitchell Wortsman, Tim Dettmers, Luke Zettlemoyer, Ari Morcos, Ali Farhadi, and Ludwig Schmidt. "Stable and Low-Precision Training for Large-Scale Vision-Language Models." NeurIPS 2023. arXiv:2304.13013.
Jeremy M. Cohen, Behrooz Ghorbani, Shankar Krishnan, Naman Agarwal, Sourabh Medapati, Michal Badura, Daniel Suo, David Cardoze, Zachary Nado, George E. Dahl, and Justin Gilmer. "Adaptive Gradient Methods at the Edge of Stability." arXiv:2207.14484.
Jeremy M. Cohen, Alex Damian, Ameet Talwalkar, Zico Kolter, and Jason D. Lee. "Understanding Optimization in Deep Learning with Central Flows." ICLR 2025. arXiv:2410.24206.
Zhiwei Bai, Zhangchen Zhou, Jiajie Zhao, Xiaolong Li, Zhiyu Li, Feiyu Xiong, Hongkang Yang, Yaoyu Zhang, and Zhi-Qin John Xu. "Adaptive Preconditioners Trigger Loss Spikes in Adam." ICML 2026.
Atli Kosson, Bettina Messmer, and Martin Jaggi. "Analyzing & Reducing the Need for Learning Rate Warmup in GPT Training." NeurIPS 2024. arXiv:2410.23922.
John St John. "AdamD: Improved Bias-Correction in Adam." arXiv:2110.10828.
Zachary Nado. " $\epsilon$ , A Nuisance No More." Blog post, 2019.
Keller Jordan et al. "modded-nanogpt: Speedrunning the NanoGPT Baseline." GitHub repository.
RWKV project.
PyTorch documentation. "torch.optim.Adam" and "torch.optim.AdamW."
Su Jianlin. "为什么Adam的Update RMS是0.2？" Scientific Spaces / 科学空间, 2025.
Su Jianlin. "重新思考学习率与Batch Size（四）：EMA." Scientific Spaces / 科学空间, 2025.
Su Jianlin. "AdamW的Weight RMS的渐近估计（上）." Scientific Spaces / 科学空间, 2025.
Su Jianlin. "AdamW的Weight RMS的渐近估计（下）." Scientific Spaces / 科学空间, 2025.
Aaron Defazio. "Why Gradients Rapidly Increase Near the End of Training." arXiv:2506.02285, 2025.
Quan Nguyen. "How to Set $\beta_1, \beta_2$ in Adam: An Online Learning Perspective." ALT 2026. arXiv:2510.03478.
Igor Molybog, Peter Albert, Moya Chen, Zachary DeVito, David Esiobu, Naman Goyal, Punit Singh Koura, Sharan Narang, Andrew Poulton, Ruan Silva, Binh Tang, Diana Liskovich, Puxin Xu, Yuchen Zhang, Melanie Kambadur, Stephen Roller, Susan Zhang. "A Theory on Adam Instability in Large-Scale Machine Learning." arXiv:2304.09871.
Jerry Ma, Denis Yarats. "On the adequacy of untuned warmup for adaptive optimization." arXiv:1910.04209

Cite this post

@online{adam-update,
  author    = {nor},
  title     = {Remark #2: The Adam update},
  year      = {2026},
  month     = {05},
  day       = {29},
  url       = {https://nor-blog.pages.dev/posts/2026-05-29-adam-update/},
}