🤖 AI Summary
This work addresses the lack of theoretical guarantees for the convergence of stochastic preconditioned SGD methods—such as Adam and RMSProp—under heavy-tailed noise with only finite $p$-th order moments, and clarifies the unresolved performance gap between normalization and gradient clipping. By developing a worst-case complexity analysis framework, the study establishes that normalization is theoretically superior to clipping: leveraging a novel vector-valued Burkholder-type inequality to handle the statistical dependence between the preconditioner and gradients, it proves that normalization achieves optimal convergence rates of $O(T^{-(p-1)/(3p-2)})$ when problem parameters are known and $O(T^{-(p-1)/(2p)})$ when unknown, whereas gradient clipping may fail to converge in the worst case.
📝 Abstract
We develop a worst-case complexity theory for stochastically preconditioned stochastic gradient descent (SPSGD) and its accelerated variants under heavy-tailed noise, a setting that encompasses widely used adaptive methods such as Adam, RMSProp, and Shampoo. We assume the stochastic gradient noise has a finite $p$-th moment for some $p \in (1,2]$, and measure convergence after $T$ iterations. While clipping and normalization are parallel tools for stabilizing training of SGD under heavy-tailed noise, there is a fundamental separation in their worst-case properties in stochastically preconditioned settings. We demonstrate that normalization guarantees convergence to a first-order stationary point at rate $\mathcal{O}(T^{-\frac{p-1}{3p-2}})$ when problem parameters are known, and $\mathcal{O}(T^{-\frac{p-1}{2p}})$ when problem parameters are unknown, matching the optimal rates for normalized SGD, respectively. In contrast, we prove that clipping may fail to converge in the worst case due to the statistical dependence between the stochastic preconditioner and the gradient estimates. To enable the analysis, we develop a novel vector-valued Burkholder-type inequality that may be of independent interest. These results provide a theoretical explanation for the empirical preference for normalization over clipping in large-scale model training.