Can Adaptive Gradient Methods Converge under Heavy-Tailed Noise? A Case Study of AdaGrad

📅 2026-05-18

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This work investigates the convergence of adaptive gradient methods under heavy-tailed noise, with a focus on AdaGrad in non-convex optimization without prior knowledge of noise characteristics. For gradient noise following a heavy-tailed distribution with tail index $4/3 < p \leq 2$, the study establishes, for the first time, that AdaGrad automatically adapts and achieves convergence, while also deriving an algorithm-specific lower bound that reveals its inability to attain the existing minimax-optimal rate. The analysis is further extended to AdaGrad-Norm, yielding improved convergence rates over a broader range $1 < p \leq 2$. By integrating stochastic optimization theory, heavy-tailed distribution analysis, and adaptive learning rate mechanisms, this research provides a theoretical foundation for understanding the robustness of adaptive methods in adverse noise environments.

📝 Abstract

Many tasks in modern machine learning are observed to involve heavy-tailed gradient noise during the optimization process. To manage this realistic and challenging setting, new mechanisms, such as gradient clipping and gradient normalization, have been introduced to ensure the convergence of first-order algorithms. However, adaptive gradient methods, a famous class of modern optimizers that includes popular $\mathtt{Adam}$ and $\mathtt{AdamW}$, often perform well even without any extra operations mentioned above. It is therefore natural to ask whether adaptive gradient methods can converge under heavy-tailed noise without any algorithmic changes. In this work, we take the first step toward answering this question by investigating a special case, $\mathtt{AdaGrad}$, the origin of adaptive gradient methods. We provide the first provable convergence rate for $\mathtt{AdaGrad}$ in non-convex optimization when the tail index $p$ satisfies $4/3<p\leq2$. Notably, this result is achieved without requiring any prior knowledge of $p$ and is hence adaptive to the tail index. In addition, we develop an algorithm-dependent lower bound, suggesting that the existing minimax rate for heavy-tailed optimization is not attainable by $\mathtt{AdaGrad}$. Lastly, we consider $\mathtt{AdaGrad}\text{-}\mathtt{Norm}$, a popular variant of $\mathtt{AdaGrad}$ in theoretical studies, and show an improved rate that holds for any $1<p\leq2$ under an extra mild assumption.

Problem

Research questions and friction points this paper is trying to address.

adaptive gradient methods

heavy-tailed noise

convergence

AdaGrad

non-convex optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

AdaGrad

heavy-tailed noise

adaptive convergence