Revisiting the Last-Iterate Convergence of Stochastic Gradient Methods

📅 2023-12-13
🏛️ arXiv.org
📈 Citations: 24
Influential: 10
📄 PDF
🤖 AI Summary
This work addresses three fundamental gaps in the theoretical understanding of stochastic gradient descent (SGD): (1) absence of optimal-rate convergence guarantees under non-compact domains and bounded-noise assumptions; (2) scarcity of final-iterate analyses for smooth optimization; and (3) lack of a unified framework for composite objectives, non-Euclidean geometries, and heavy-tailed noise. We propose the first unified analysis framework accommodating non-compact domains, composite regularization, Bregman geometry, heavy-tailed stochastic noise, and general convexity/smoothness conditions. Our approach integrates generalized Bregman divergences, adaptive step sizes, and heavy-tailed robust estimation techniques. We establish optimal $Oig(sqrt{log(1/delta)/T}ig)$ convergence rates both in expectation and with high probability $1-delta$, thereby substantially broadening the theoretical applicability of SGD.
📝 Abstract
In the past several years, the last-iterate convergence of the Stochastic Gradient Descent (SGD) algorithm has triggered people's interest due to its good performance in practice but lack of theoretical understanding. For Lipschitz convex functions, different works have established the optimal $O(log(1/delta)log T/sqrt{T})$ or $O(sqrt{log(1/delta)/T})$ high-probability convergence rates for the final iterate, where $T$ is the time horizon and $delta$ is the failure probability. However, to prove these bounds, all the existing works are either limited to compact domains or require almost surely bounded noises. It is natural to ask whether the last iterate of SGD can still guarantee the optimal convergence rate but without these two restrictive assumptions. Besides this important question, there are still lots of theoretical problems lacking an answer. For example, compared with the last-iterate convergence of SGD for non-smooth problems, only few results for smooth optimization have yet been developed. Additionally, the existing results are all limited to a non-composite objective and the standard Euclidean norm. It still remains unclear whether the last-iterate convergence can be provably extended to wider composite optimization and non-Euclidean norms. In this work, to address the issues mentioned above, we revisit the last-iterate convergence of stochastic gradient methods and provide the first unified way to prove the convergence rates both in expectation and in high probability to accommodate general domains, composite objectives, non-Euclidean norms, Lipschitz conditions, smoothness, and (strong) convexity simultaneously. Additionally, we extend our analysis to obtain the last-iterate convergence under heavy-tailed noises.
Problem

Research questions and friction points this paper is trying to address.

Analyzes last-iterate convergence of SGD without domain or noise restrictions
Extends convergence theory to composite objectives and non-Euclidean norms
Provides unified proof for various function classes and noise conditions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified convergence proof for general domains and composite objectives
Extends analysis to non-Euclidean norms and heavy-tailed noise
Simultaneously accommodates Lipschitz, smooth, and convex conditions