Optimal High-probability Convergence of Nonlinear SGD under Heavy-tailed Noise via Symmetrization

📅 2025-07-11

📈 Citations: 0

✨ Influential: 0

career value

239K/year

🤖 AI Summary

This work studies high-probability convergence of SGD-type algorithms for nonconvex optimization under heavy-tailed noise—potentially unbounded, asymmetric, and lacking p-th moments for any p ≤ 2. We propose a nonlinear SGD (N-SGD) framework based on noise symmetrization, integrating a symmetrized gradient estimator (SGE) and its minibatch variant (MSGE), coupled with nonlinear compression (e.g., sign, clipping, normalization, and their smooth counterparts) and refined tail-control techniques. We establish, for the first time under unbounded-moment and asymmetric heavy-tailed noise, an optimal ( ilde{O}(t^{-1/2})) convergence rate with exponential high-probability guarantees. N-SGD and N-SGE achieve optimal oracle complexity, substantially outperforming existing methods—especially when (p < 2). N-MSGE attains near-optimal performance for (1 < p leq 2).

Technology Category

Application Category

📝 Abstract

We study convergence in high-probability of SGD-type methods in non-convex optimization and the presence of heavy-tailed noise. To combat the heavy-tailed noise, a general black-box nonlinear framework is considered, subsuming nonlinearities like sign, clipping, normalization and their smooth counterparts. Our first result shows that nonlinear SGD (N-SGD) achieves the rate $widetilde{mathcal{O}}(t^{-1/2})$, for any noise with unbounded moments and a symmetric probability density function (PDF). Crucially, N-SGD has exponentially decaying tails, matching the performance of linear SGD under light-tailed noise. To handle non-symmetric noise, we propose two novel estimators, based on the idea of noise symmetrization. The first, dubbed Symmetrized Gradient Estimator (SGE), assumes a noiseless gradient at any reference point is available at the start of training, while the second, dubbed Mini-batch SGE (MSGE), uses mini-batches to estimate the noiseless gradient. Combined with the nonlinear framework, we get N-SGE and N-MSGE methods, respectively, both achieving the same convergence rate and exponentially decaying tails as N-SGD, while allowing for non-symmetric noise with unbounded moments and PDF satisfying a mild technical condition, with N-MSGE additionally requiring bounded noise moment of order $p in (1,2]$. Compared to works assuming noise with bounded $p$-th moment, our results: 1) are based on a novel symmetrization approach; 2) provide a unified framework and relaxed moment conditions; 3) imply optimal oracle complexity of N-SGD and N-SGE, strictly better than existing works when $p < 2$, while the complexity of N-MSGE is close to existing works. Compared to works assuming symmetric noise with unbounded moments, we: 1) provide a sharper analysis and improved rates; 2) facilitate state-dependent symmetric noise; 3) extend the strong guarantees to non-symmetric noise.

Problem

Research questions and friction points this paper is trying to address.

Study high-probability convergence of SGD under heavy-tailed noise

Propose nonlinear SGD framework for unbounded symmetric noise

Develop symmetrized estimators for non-symmetric noise with unbounded moments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Nonlinear SGD framework for heavy-tailed noise

Symmetrized Gradient Estimator for non-symmetric noise

Mini-batch SGE for efficient gradient estimation

🔎 Similar Papers

Revisiting the Last-Iterate Convergence of Stochastic Gradient Methods