🤖 AI Summary
This work addresses the limitations of existing nonsmooth optimization theory, which often relies on the strong assumption of uniformly bounded gradients and thus fails to encompass many practical problems. The authors introduce a more realistic generalized Lipschitz condition, wherein the gradient norm is controlled by an affine function of the optimality gap, and systematically analyze the convergence of stochastic optimization algorithms under this framework. A key contribution is the theoretical demonstration of the superiority of AdamW combined with gradient clipping, highlighting the critical role of its exponentially weighted gradient accumulation mechanism in this setting. The analysis is further extended to broader scenarios, including generalized smoothness and quasar-convexity, yielding accelerated convergence guarantees. The results show that clipped AdamW significantly outperforms SGD and AdaGrad in generalized Lipschitz convex optimization, achieving faster convergence rates.
📝 Abstract
Much of the existing theory on first-order non-smooth optimization is built on a restrictive assumption that the gradients of the objective function are uniformly bounded. We introduce a much more realistic class of generalized Lipschitz functions, where the gradient norms are bounded by an affine function of the optimality gap. We then ask a natural question: what algorithm achieves the best global convergence rates for solving convex stochastic generalized Lipschitz optimization problems? To address this, we develop a new convergence analysis for several existing algorithms and find that AdamW with clipped updates, theoretically outperforms other popular stochastic optimization methods, such as SGD and AdaGrad. Moreover, our analysis establishes the critical role of AdamW's exponentially weighted gradient accumulation, as opposed to simple averaging. We further show that clipped AdamW is universal and achieves improved rates under the popular generalized smoothness assumption, analyze the convergence of clipped AdamW with diagonal and matrix preconditioners, and extend our results to the quasar-convex setting.