🤖 AI Summary
This paper studies the convergence of adaptive-step-size gradient descent (GD) for logistic regression under linearly separable data. The proposed method employs a risk-dependent large step size: the step size is scaled by the current loss value, with an arbitrarily large scaling factor η. Theoretically, after at most $1/gamma^2$ warm-up iterations—where $gamma$ denotes the geometric margin—the risk decays exponentially at rate $exp(-Theta(eta))$. The total iteration complexity matches the minimax lower bound $Omega(1/gamma^2)$, achieving optimal first-order batch-method convergence. Notably, this work establishes, for the first time, instantaneous high-accuracy convergence despite non-monotonic risk evolution, and rigorously proves exact equivalence—in both iteration count and constants—to the classical perceptron algorithm. The results extend to generalized convex loss functions and certain two-layer neural networks, thereby unifying the theoretical foundations of GD and the perceptron.
📝 Abstract
We study $ extit{gradient descent}$ (GD) for logistic regression on linearly separable data with stepsizes that adapt to the current risk, scaled by a constant hyperparameter $eta$. We show that after at most $1/gamma^2$ burn-in steps, GD achieves a risk upper bounded by $exp(-Theta(eta))$, where $gamma$ is the margin of the dataset. As $eta$ can be arbitrarily large, GD attains an arbitrarily small risk $ extit{immediately after the burn-in steps}$, though the risk evolution may be $ extit{non-monotonic}$. We further construct hard datasets with margin $gamma$, where any batch or online first-order method requires $Omega(1/gamma^2)$ steps to find a linear separator. Thus, GD with large, adaptive stepsizes is $ extit{minimax optimal}$ among first-order batch methods. Notably, the classical $ extit{Perceptron}$ (Novikoff, 1962), a first-order online method, also achieves a step complexity of $1/gamma^2$, matching GD even in constants. Finally, our GD analysis extends to a broad class of loss functions and certain two-layer networks.