Any-stepsize Gradient Descent for Separable Data under Fenchel--Young Losses

📅 2025-02-07

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

This paper investigates the convergence of gradient descent (GD) on separable data under Fenchel–Young losses with *arbitrary* step sizes—removing classical assumptions of small steps or self-boundedness (e.g., exponential tails in logistic regression). The authors develop the first unified analytical framework, establishing that the *separation margin*, rather than self-boundedness, is the essential structural property ensuring convergence for any step size. They prove that most Fenchel–Young losses satisfy this margin condition. Specifically, GD with Tsallis entropy loss achieves an $O(varepsilon^{-1/2})$ convergence rate to $varepsilon$-suboptimality, while Rényi entropy loss yields the currently best-known rate of $O(varepsilon^{-1/3})$. The analysis integrates classical perceptron-style arguments, convex conjugacy theory, and precise geometric characterization of the separation boundary.

Technology Category

Application Category

📝 Abstract

The gradient descent (GD) has been one of the most common optimizer in machine learning. In particular, the loss landscape of a neural network is typically sharpened during the initial phase of training, making the training dynamics hover on the edge of stability. This is beyond our standard understanding of GD convergence in the stable regime where arbitrarily chosen stepsize is sufficiently smaller than the edge of stability. Recently, Wu et al. (COLT2024) have showed that GD converges with arbitrary stepsize under linearly separable logistic regression. Although their analysis hinges on the self-bounding property of the logistic loss, which seems to be a cornerstone to establish a modified descent lemma, our pilot study shows that other loss functions without the self-bounding property can make GD converge with arbitrary stepsize. To further understand what property of a loss function matters in GD, we aim to show arbitrary-stepsize GD convergence for a general loss function based on the framework of emph{Fenchel--Young losses}. We essentially leverage the classical perceptron argument to derive the convergence rate for achieving $epsilon$-optimal loss, which is possible for a majority of Fenchel--Young losses. Among typical loss functions, the Tsallis entropy achieves the GD convergence rate $T=Omega(epsilon^{-1/2})$, and the R{'e}nyi entropy achieves the far better rate $T=Omega(epsilon^{-1/3})$. We argue that these better rate is possible because of emph{separation margin} of loss functions, instead of the self-bounding property.

Problem

Research questions and friction points this paper is trying to address.

Explores GD convergence with arbitrary stepsize.

Analyzes Fenchel--Young losses for GD convergence.

Links separation margin to better convergence rates.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Arbitrary-stepsize gradient descent

Fenchel--Young losses framework

Separation margin property

🔎 Similar Papers

No similar papers found.