Clipping Improves Adam-Norm and AdaGrad-Norm when the Noise Is Heavy-Tailed

📅 2024-06-06
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the failure of high-probability convergence for AdaGrad- and Adam-type adaptive optimizers under heavy-tailed stochastic gradient noise. We first identify the mechanism behind their severely degraded convergence rates—without gradient clipping—under such noise. To restore robustness, we propose clipped variants: AdaGrad-Norm and Adam-Norm, in both delayed and non-delayed forms. We establish tight high-probability convergence bounds with only polylogarithmic dependence on the confidence level, proving strong robust convergence for both convex and non-convex objectives. Experiments confirm that gradient clipping significantly improves stability and convergence speed under heavy-tailed noise, outperforming the original unclipped methods. Our core contribution is twofold: (i) the first systematic characterization of the high-probability convergence breakdown of adaptive optimizers under heavy-tailed noise; and (ii) a theoretically grounded remedy via gradient clipping that provably recovers robustness.

Technology Category

Application Category

📝 Abstract
Methods with adaptive stepsizes, such as AdaGrad and Adam, are essential for training modern Deep Learning models, especially Large Language Models. Typically, the noise in the stochastic gradients is heavy-tailed for the later ones. Gradient clipping provably helps to achieve good high-probability convergence for such noises. However, despite the similarity between AdaGrad/Adam and Clip-SGD, the current understanding of the high-probability convergence of AdaGrad/Adam-type methods is limited in this case. In this work, we prove that AdaGrad/Adam (and their delayed version) can have provably bad high-probability convergence if the noise is heavy-tailed. We also show that gradient clipping fixes this issue, i.e., we derive new high-probability convergence bounds with polylogarithmic dependence on the confidence level for AdaGrad-Norm and Adam-Norm with clipping and with/without delay for smooth convex/non-convex stochastic optimization with heavy-tailed noise. Our empirical evaluations highlight the superiority of clipped versions of AdaGrad/Adam-Norm in handling the heavy-tailed noise.
Problem

Research questions and friction points this paper is trying to address.

AdaGrad/Adam methods struggle with heavy-tailed noise convergence.
Gradient clipping improves AdaGrad/Adam convergence for heavy-tailed noise.
Clipped AdaGrad/Adam-Norm shows superior performance in empirical evaluations.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Gradient clipping enhances AdaGrad/Adam convergence.
Clipping addresses heavy-tailed noise in gradients.
New bounds show polylogarithmic convergence with clipping.
🔎 Similar Papers
No similar papers found.
S
S. Chezhegov
MIPT, ISP RAS
Y
Yaroslav Klyukin
MIPT
A
Andrei Semenov
MIPT
Aleksandr Beznosikov
Aleksandr Beznosikov
PhD, Basic Research of Artificial Intelligence Lab
OptimizationMachine Learning
A
A. Gasnikov
Innopolis University, MIPT, Skoltech
S
Samuel Horv'ath
MBZUAI
M
Martin Tak'avc
MBZUAI
Eduard Gorbunov
Eduard Gorbunov
Assistant Professor, Mohamed bin Zayed University of Artificial Intelligence
OptimizationMachine LearningFederated LearningVariational Inequalities