🤖 AI Summary
This work addresses training instability—manifested as loss spikes—caused by heavy-tailed gradient noise arising from rare linguistic expressions and deep compositional structures. The authors identify that this noise exhibits element-wise localization and propose an element-wise gradient clipping method that respects matrix structure while maintaining computational efficiency. By leveraging first-order perturbation analysis, the approach effectively controls the gradient spectrum through element-wise smooth shrinkage and a proxy for Bayes-optimal estimation under a Gaussian prior. The method provides an $O(\varepsilon^{-4})$ convergence guarantee under Cauchy-contaminated noise. Experiments demonstrate a 7% reduction in training tokens during NanoGPT pretraining, with an additional 2% saving when combined with the Muon optimizer, significantly enhancing both training stability and efficiency.
📝 Abstract
Training instabilities such as loss spikes are frequently the result of stochastic gradient noise. Because of rare expressions in language training data, and multiple layer composition, the noise impact is heavy-tailed and survives mini-batch averaging. Existing remedies trade off structure against cost: vector-norm clipping ignores the matrix structure of weight updates, while spectral normalization (e.g., Muon (Jordan et al., 2024)) respects it at additional cost. We show that this trade-off can be balanced. Real gradient noise appears to be similar to entry-wise heavy-tailed contamination, and a first-order perturbation analysis reveals a localization property of such noise, under which a simple entry-wise method achieves spectral control. Exploiting this, we derive a tractable surrogate for the Bayes-optimal entry-wise estimator under a Gaussian signal prior. We establish $O(ε^{-4})$ convergence guarantee under Cauchy-contaminated noise. Empirically, we find that smooth shrinkage improves Adam on NanoGPT pretraining, saving ${\sim}7\%$ of training tokens. We further find that applying the entry-wise clipping before spectral normalization yields a ${\sim}2\%$ token saving on top of Muon.