Large Spikes in Stochastic Gradient Descent: A Large-Deviations View

๐Ÿ“… 2026-03-10
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work investigates the emergence of large-scale spikes during stochastic gradient descent (SGD) training of neural networks and their underlying causes. Within the neural tangent kernel (NTK) framework, the study focuses on shallow fully connected networks and establishes, for the first time, a quantitative connection between spike behavior and large deviation theory. The authors introduce a computable criterion function \( G \) that characterizes how learning rate, data distribution, and kernel structure jointly influence spike formation. Theoretical analysis reveals that spikes occur with high probability when \( G > 0 \), whereas their occurrence probability decays exponentially as \( (n/\eta)^{-\vartheta/2} \) when \( G < 0 \). This result provides a rigorous theoretical foundation for the observable spikes in finite-width neural networks.

Technology Category

Application Category

๐Ÿ“ Abstract
We analyse SGD training of a shallow, fully connected network in the NTK scaling and provide a quantitative theory of the catapult phase. We identify an explicit criterion separating two behaviours: When an explicit function $G$, depending only on the kernel, learning rate $\eta$ and data, is positive, SGD produces large NTK-flattening spikes with high probability; when $G<0$, their probability decays like $(n/\eta)^{-\vartheta/2}$, for an explicitly characterised $\vartheta\in (0,\infty)$. This yields a concrete parameter-dependent explanation for why such spikes may still be observed at practical widths.
Problem

Research questions and friction points this paper is trying to address.

stochastic gradient descent
large deviations
NTK scaling
spike phenomenon
neural network training
Innovation

Methods, ideas, or system contributions that make the work stand out.

large deviations
stochastic gradient descent
neural tangent kernel
catapult phase
spike dynamics