Adaptive Preconditioners Trigger Loss Spikes in Adam

📅 2025-06-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work identifies an intrinsic mechanism underlying loss spikes in Adam: when the squared gradient is significantly smaller than its exponential moving average (governed by β₂), the adaptive preconditioner causes the largest eigenvalue of the preconditioned Hessian to persistently exceed the critical threshold, aligning gradients with principal curvature directions and triggering spikes. Contrary to prior attributions to loss landscape sharpness, we establish the first precise correspondence between a gradient–curvature critical condition (>2/η) and spike occurrence. Using Hessian spectral estimation, second-moment dynamics modeling, and cross-architecture experiments (MLP, CNN, Transformer), we empirically validate the causal link between preconditioner lag and eigenvalue violation. Our findings provide a novel theoretical framework for diagnosing and stabilizing adaptive optimization, along with actionable design principles for improved robustness.

Technology Category

Application Category

📝 Abstract
Loss spikes emerge commonly during training across neural networks of varying architectures and scales when using the Adam optimizer. In this work, we investigate the underlying mechanism responsible for Adam spikes. While previous explanations attribute these phenomena to the lower-loss-as-sharper characteristics of the loss landscape, our analysis reveals that Adam's adaptive preconditioners themselves can trigger spikes. Specifically, we identify a critical regime where squared gradients become substantially smaller than the second-order moment estimates, causing the latter to undergo a $eta_2$-exponential decay and to respond sluggishly to current gradient information. This mechanism can push the maximum eigenvalue of the preconditioned Hessian beyond the classical stability threshold $2/eta$ for a sustained period, inducing instability. This instability further leads to an alignment between the gradient and the maximum eigendirection, and a loss spike occurs precisely when the gradient-directional curvature exceeds $2/eta$. We verify this mechanism through extensive experiments on fully connected networks, convolutional networks, and Transformer architectures.
Problem

Research questions and friction points this paper is trying to address.

Adam optimizer triggers loss spikes in neural networks
Adaptive preconditioners cause instability in training dynamics
Gradient-curvature misalignment leads to loss spikes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adam's adaptive preconditioners trigger loss spikes
Squared gradients smaller than second-order estimates
Preconditioned Hessian exceeds stability threshold
🔎 Similar Papers
No similar papers found.
Zhiwei Bai
Zhiwei Bai
Shanghai Jiao Tong University
Machine Learning;Deep Learning
Z
Zhangchen Zhou
Institute of Natural Sciences, School of Mathematical Sciences, Shanghai Jiao Tong University
J
Jiajie Zhao
Institute of Natural Sciences, School of Mathematical Sciences, Shanghai Jiao Tong University
X
Xiaolong Li
Institute of Natural Sciences, School of Mathematical Sciences, Shanghai Jiao Tong University
Zhiyu Li
Zhiyu Li
Tianjin University
Robust controlattitude control
Feiyu Xiong
Feiyu Xiong
MemTensor (Shanghai) Technology Co., Ltd.
Machine LearningNLPLLM
H
Hongkang Yang
MemTensor (Shanghai) Technology Co., Ltd.
Yaoyu Zhang
Yaoyu Zhang
Shanghai Jiao Tong University
Deep Learning Theory
Z
Zhi-Qin John Xu
Institute of Natural Sciences, School of Mathematical Sciences, Shanghai Jiao Tong University, MOE-LSC, School of Artificial Intelligence, Shanghai Jiao Tong University, Center for LLM, Institute for Advanced Algorithms Research, Shanghai