Methods for Convex $(L_0,L_1)$-Smooth Optimization: Clipping, Acceleration, and Adaptivity

📅 2024-09-23

📈 Citations: 3

✨ Influential: 1

career value

212K/year

🤖 AI Summary

We address convex optimization problems in machine learning that are nonsmooth yet satisfy $(L_0,L_1)$-smoothness—a structural generalization of classical $C^{1,1}$ smoothness. We propose a suite of novel algorithms that dispense with the standard $C^{1,1}$ assumption and achieve convergence rates independent of the initial point’s distance to the optimum. Methodologically, we (i) establish the first tight deterministic and stochastic convergence bounds for gradient clipping and the Polyak stepsize method, eliminating exponential dependence on initialization; (ii) design the first Nesterov-type accelerated algorithm for $(L_0,L_1)$-smooth convex functions, extended to stochastic overparameterized settings; and (iii) integrate Adaptive Gradient Descent (within the Malitsky–Mishchenko framework) for fully adaptive stepsize selection. All results hold for both strongly convex and general convex objectives, with rigorous theoretical guarantees. Our methods significantly improve upon state-of-the-art performance under nonsmooth yet structured smoothness assumptions.

Technology Category

Application Category

📝 Abstract

Due to the non-smoothness of optimization problems in Machine Learning, generalized smoothness assumptions have been gaining a lot of attention in recent years. One of the most popular assumptions of this type is $(L_0,L_1)$-smoothness (Zhang et al., 2020). In this paper, we focus on the class of (strongly) convex $(L_0,L_1)$-smooth functions and derive new convergence guarantees for several existing methods. In particular, we derive improved convergence rates for Gradient Descent with (Smoothed) Gradient Clipping and for Gradient Descent with Polyak Stepsizes. In contrast to the existing results, our rates do not rely on the standard smoothness assumption and do not suffer from the exponential dependency from the initial distance to the solution. We also extend these results to the stochastic case under the over-parameterization assumption, propose a new accelerated method for convex $(L_0,L_1)$-smooth optimization, and derive new convergence rates for Adaptive Gradient Descent (Malitsky and Mishchenko, 2020).

Problem

Research questions and friction points this paper is trying to address.

Machine Learning Optimization

Non-smooth Functions

Convergence Rate

Innovation

Methods, ideas, or system contributions that make the work stand out.

(L_0,L_1)-smooth optimization

accelerated methods for stochastic optimization

convergence rate analysis for adaptive gradient descent

🔎 Similar Papers

Faster Acceleration for Steepest Descent