Gradient Descent's Last Iterate is Often (slightly) Suboptimal

📅 2026-04-15

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

This work investigates whether gradient descent and its stochastic variants can achieve optimal convergence rates in the final iterate when the total number of iterations is not predetermined—that is, under anytime termination. Focusing on convex Lipschitz function optimization, the authors construct explicit counterexamples and conduct a refined analysis of step-size schedules to rigorously establish, for the first time, that no step-size strategy can eliminate a logarithmic factor in the final-iterate error. This result confirms a conjecture by Jain et al., proving a lower bound of Ω(log T / √T) on the convergence rate of the last iterate in the anytime setting and demonstrating that such suboptimal stopping behavior is inherent and universal.

Technology Category

Application Category

📝 Abstract

We consider the well-studied setting of minimizing a convex Lipschitz function using either gradient descent (GD) or its stochastic variant (SGD), and examine the last iterate convergence. By now, it is known that standard stepsize choices lead to a last iterate convergence rate of $\log T/\sqrt{T}$ after $T$ steps. A breakthrough result of Jain et al. [2019] recovered the optimal $1/\sqrt{T}$ rate by constructing a non-standard stepsize sequence. However, this sequence requires choosing $T$ in advance, as opposed to common stepsize schedules which apply for any time horizon. Moreover, Jain et al. conjectured that without prior knowledge of $T$, no stepsize sequence can ensure the optimal error for SGD's last iterate, a claim which so far remained unproven. We prove this conjecture, and in fact show that even in the noiseless case of GD, it is impossible to avoid an excess poly-log factor in $T$ when considering an anytime last iterate guarantee. Our proof further suggests that such (slightly) suboptimal stopping times are unavoidably common.

Problem

Research questions and friction points this paper is trying to address.

gradient descent

last iterate convergence

stepsize schedule

anytime guarantee

convergence rate

Innovation

Methods, ideas, or system contributions that make the work stand out.

last iterate convergence

gradient descent

stochastic gradient descent