Convergence of Clipped-SGD for Convex $(L_0,L_1)$-Smooth Optimization with Heavy-Tailed Noise

📅 2025-05-27

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

This paper establishes high-probability convergence guarantees for Clip-SGD under the joint assumptions that the objective function is convex, $(L_0, L_1)$-smooth, and the stochastic gradients exhibit heavy-tailed noise (i.e., non-sub-Gaussian). Addressing a gap in prior work—which lacks systematic analysis of the interplay between heavy-tailed noise and $(L_0, L_1)$-smoothness—this work derives the first tight high-probability convergence bound for Clip-SGD, free of exponential dependence on problem parameters. Methodologically, it integrates generalized gradient clipping analysis, high-probability concentration inequalities tailored to heavy-tailed distributions, and a refined treatment of $(L_0, L_1)$-smoothness. The resulting bound unifies deterministic optimization and classical stochastic optimization ($L_1 = 0$) as special cases, thereby substantially broadening both the applicability and robustness guarantees of gradient clipping theory.

Technology Category

Application Category

📝 Abstract

Gradient clipping is a widely used technique in Machine Learning and Deep Learning (DL), known for its effectiveness in mitigating the impact of heavy-tailed noise, which frequently arises in the training of large language models. Additionally, first-order methods with clipping, such as Clip-SGD, exhibit stronger convergence guarantees than SGD under the $(L_0,L_1)$-smoothness assumption, a property observed in many DL tasks. However, the high-probability convergence of Clip-SGD under both assumptions -- heavy-tailed noise and $(L_0,L_1)$-smoothness -- has not been fully addressed in the literature. In this paper, we bridge this critical gap by establishing the first high-probability convergence bounds for Clip-SGD applied to convex $(L_0,L_1)$-smooth optimization with heavy-tailed noise. Our analysis extends prior results by recovering known bounds for the deterministic case and the stochastic setting with $L_1 = 0$ as special cases. Notably, our rates avoid exponentially large factors and do not rely on restrictive sub-Gaussian noise assumptions, significantly broadening the applicability of gradient clipping.

Problem

Research questions and friction points this paper is trying to address.

Analyzing Clip-SGD convergence under heavy-tailed noise

Establishing bounds for convex (L0,L1)-smooth optimization

Removing restrictive sub-Gaussian noise assumptions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Clip-SGD for heavy-tailed noise mitigation

Convergence under (L0,L1)-smoothness assumption

High-probability bounds without sub-Gaussian noise

🔎 Similar Papers

Clipping Improves Adam-Norm and AdaGrad-Norm when the Noise Is Heavy-Tailed