Large Spikes in Stochastic Gradient Descent: A Large-Deviations View

📅 2026-03-10

📈 Citations: 0

✨ Influential: 0

career value

251K/year

🤖 AI Summary

This work investigates the emergence of large-scale spikes during stochastic gradient descent (SGD) training of neural networks and their underlying causes. Within the neural tangent kernel (NTK) framework, the study focuses on shallow fully connected networks and establishes, for the first time, a quantitative connection between spike behavior and large deviation theory. The authors introduce a computable criterion function $ G $ that characterizes how learning rate, data distribution, and kernel structure jointly influence spike formation. Theoretical analysis reveals that spikes occur with high probability when $ G > 0 $, whereas their occurrence probability decays exponentially as $ (n/\eta)^{-\vartheta/2} $ when $ G < 0 $. This result provides a rigorous theoretical foundation for the observable spikes in finite-width neural networks.

Technology Category

Application Category

📝 Abstract

We analyse SGD training of a shallow, fully connected network in the NTK scaling and provide a quantitative theory of the catapult phase. We identify an explicit criterion separating two behaviours: When an explicit function $G$, depending only on the kernel, learning rate $\eta$ and data, is positive, SGD produces large NTK-flattening spikes with high probability; when $G<0$, their probability decays like $(n/\eta)^{-\vartheta/2}$, for an explicitly characterised $\vartheta\in (0,\infty)$. This yields a concrete parameter-dependent explanation for why such spikes may still be observed at practical widths.

Problem

Research questions and friction points this paper is trying to address.

stochastic gradient descent

large deviations

NTK scaling

spike phenomenon

neural network training

Innovation

Methods, ideas, or system contributions that make the work stand out.

large deviations

stochastic gradient descent

neural tangent kernel