Minimax Optimal Convergence of Gradient Descent in Logistic Regression via Large and Adaptive Stepsizes

📅 2025-04-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper studies the convergence of adaptive-step-size gradient descent (GD) for logistic regression under linearly separable data. The proposed method employs a risk-dependent large step size: the step size is scaled by the current loss value, with an arbitrarily large scaling factor η. Theoretically, after at most $1/gamma^2$ warm-up iterations—where $gamma$ denotes the geometric margin—the risk decays exponentially at rate $exp(-Theta(eta))$. The total iteration complexity matches the minimax lower bound $Omega(1/gamma^2)$, achieving optimal first-order batch-method convergence. Notably, this work establishes, for the first time, instantaneous high-accuracy convergence despite non-monotonic risk evolution, and rigorously proves exact equivalence—in both iteration count and constants—to the classical perceptron algorithm. The results extend to generalized convex loss functions and certain two-layer neural networks, thereby unifying the theoretical foundations of GD and the perceptron.

Technology Category

Application Category

📝 Abstract
We study $ extit{gradient descent}$ (GD) for logistic regression on linearly separable data with stepsizes that adapt to the current risk, scaled by a constant hyperparameter $eta$. We show that after at most $1/gamma^2$ burn-in steps, GD achieves a risk upper bounded by $exp(-Theta(eta))$, where $gamma$ is the margin of the dataset. As $eta$ can be arbitrarily large, GD attains an arbitrarily small risk $ extit{immediately after the burn-in steps}$, though the risk evolution may be $ extit{non-monotonic}$. We further construct hard datasets with margin $gamma$, where any batch or online first-order method requires $Omega(1/gamma^2)$ steps to find a linear separator. Thus, GD with large, adaptive stepsizes is $ extit{minimax optimal}$ among first-order batch methods. Notably, the classical $ extit{Perceptron}$ (Novikoff, 1962), a first-order online method, also achieves a step complexity of $1/gamma^2$, matching GD even in constants. Finally, our GD analysis extends to a broad class of loss functions and certain two-layer networks.
Problem

Research questions and friction points this paper is trying to address.

Analyzing gradient descent convergence in logistic regression with adaptive stepsizes
Proving minimax optimality of GD among first-order batch methods
Extending analysis to various loss functions and two-layer networks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive stepsizes scaled by hyperparameter η
Minimax optimal among first-order batch methods
Extends to various loss functions and networks
🔎 Similar Papers
No similar papers found.