🤖 AI Summary
Existing theoretical frameworks struggle to characterize the generalization performance of stochastic optimization algorithms under heavy-tailed gradient noise. This work proposes a unified analytical framework that, for the first time, systematically analyzes the generalization error of clipped and normalized SGD—including their mini-batch and momentum variants—under the weak assumption that the gradient noise possesses only a bounded centered $p$-th moment with $p \in (1,2]$. By integrating truncation techniques with algorithmic stability theory, the study establishes novel stability bounds and corresponding generalization error upper bounds. These results fill a critical theoretical gap in understanding the generalization behavior of mainstream stochastic optimization methods in the presence of heavy-tailed noise.
📝 Abstract
The empirical evidence indicates that stochastic optimization with heavy-tailed gradient noise is more appropriate to characterize the training of machine learning models than that with standard bounded gradient variance noise. Most existing works on this phenomenon focus on the convergence of optimization errors, while the analysis for generalization bounds under the heavy-tailed gradient noise remains limited. In this paper, we develop a general framework for establishing generalization bounds under heavy-tailed noise. Specifically, we introduce a truncation argument to achieve the generalization error bound based on the algorithmic stability under the assumption of bounded $p$th centered moment with $p\in(1,2]$. Building on this framework, we further provide the stability and generalization analysis for several popular stochastic algorithms under heavy-tailed noise, including clipped and normalized stochastic gradient descent, as well as their mini-batch and momentum variants.