๐ค AI Summary
This work investigates the emergence of large-scale spikes during stochastic gradient descent (SGD) training of neural networks and their underlying causes. Within the neural tangent kernel (NTK) framework, the study focuses on shallow fully connected networks and establishes, for the first time, a quantitative connection between spike behavior and large deviation theory. The authors introduce a computable criterion function \( G \) that characterizes how learning rate, data distribution, and kernel structure jointly influence spike formation. Theoretical analysis reveals that spikes occur with high probability when \( G > 0 \), whereas their occurrence probability decays exponentially as \( (n/\eta)^{-\vartheta/2} \) when \( G < 0 \). This result provides a rigorous theoretical foundation for the observable spikes in finite-width neural networks.
๐ Abstract
We analyse SGD training of a shallow, fully connected network in the NTK scaling and provide a quantitative theory of the catapult phase. We identify an explicit criterion separating two behaviours: When an explicit function $G$, depending only on the kernel, learning rate $\eta$ and data, is positive, SGD produces large NTK-flattening spikes with high probability; when $G<0$, their probability decays like $(n/\eta)^{-\vartheta/2}$, for an explicitly characterised $\vartheta\in (0,\infty)$. This yields a concrete parameter-dependent explanation for why such spikes may still be observed at practical widths.