🤖 AI Summary
This work addresses an open question in convex optimization under generalized smoothness—specifically, $(L_0, L_1)$-smoothness—regarding whether accelerated gradient methods can achieve the optimal complexity $O(sqrt{ell(0)} R / sqrt{varepsilon})$ for small error tolerances $varepsilon$. We propose a novel first-order accelerated algorithm and develop a Lyapunov function framework tailored to generalized smoothness. For the first time, we rigorously attain this optimal complexity bound under $(L_0, L_1)$-smoothness, eliminating exponential factors and extraneous dependencies present in prior approaches. Our analysis yields tight convergence rates and establishes a concise, scalable paradigm for designing and analyzing accelerated methods under broad smoothness assumptions. The result is theoretically optimal and significantly advances the understanding of acceleration beyond standard $L$-smoothness.
📝 Abstract
We study first-order methods for convex optimization problems with functions $f$ satisfying the recently proposed $ell$-smoothness condition $||
abla^{2}f(x)|| le ellleft(||
abla f(x)||
ight),$ which generalizes the $L$-smoothness and $(L_{0},L_{1})$-smoothness. While accelerated gradient descent AGD is known to reach the optimal complexity $O(sqrt{L} R / sqrt{varepsilon})$ under $L$-smoothness, where $varepsilon$ is an error tolerance and $R$ is the distance between a starting and an optimal point, existing extensions to $ell$-smoothness either incur extra dependence on the initial gradient, suffer exponential factors in $L_{1} R$, or require costly auxiliary sub-routines, leaving open whether an AGD-type $O(sqrt{ell(0)} R / sqrt{varepsilon})$ rate is possible for small-$varepsilon$, even in the $(L_{0},L_{1})$-smoothness case.
We resolve this open question. Leveraging a new Lyapunov function and designing new algorithms, we achieve $O(sqrt{ell(0)} R / sqrt{varepsilon})$ oracle complexity for small-$varepsilon$ and virtually any $ell$. For instance, for $(L_{0},L_{1})$-smoothness, our bound $O(sqrt{L_0} R / sqrt{varepsilon})$ is provably optimal in the small-$varepsilon$ regime and removes all non-constant multiplicative factors present in prior accelerated algorithms.