Full-Batch Gradient Descent Outperforms One-Pass SGD: Sample Complexity Separation in Single-Index Learning

📅 2026-02-02

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

This work investigates the statistical efficiency gap between full-batch gradient descent (GD) and single-pass stochastic gradient descent (SGD) in terms of sample complexity under the single-index model. By introducing a truncated quadratic activation, analyzing trajectory dynamics under spherical constraints, and providing a refined characterization of both the correlated and squared losses, the study establishes—for the first time in a nonlinear single-index setting—that full-batch GD achieves strong parameter recovery when the sample size satisfies $n \gtrsim d$ and the number of iterations $T \gtrsim \log d$. This result breaks the logarithmic barrier inherent to single-pass SGD, which requires $n \gtrsim d \log d$. The findings demonstrate GD’s superiority in both weak ($n \approx d$) and strong recovery regimes, highlighting the critical roles of the optimization landscape and initialization in determining statistical efficiency.

Technology Category

Application Category

📝 Abstract

It is folklore that reusing training data more than once can improve the statistical efficiency of gradient-based learning. However, beyond linear regression, the theoretical advantage of full-batch gradient descent (GD, which always reuses all the data) over one-pass stochastic gradient descent (online SGD, which uses each data point only once) remains unclear. In this work, we consider learning a $d$-dimensional single-index model with a quadratic activation, for which it is known that one-pass SGD requires $n\gtrsim d\log d$ samples to achieve weak recovery. We first show that this $\log d$ factor in the sample complexity persists for full-batch spherical GD on the correlation loss; however, by simply truncating the activation, full-batch GD exhibits a favorable optimization landscape at $n \simeq d$ samples, thereby outperforming one-pass SGD (with the same activation) in statistical efficiency. We complement this result with a trajectory analysis of full-batch GD on the squared loss from small initialization, showing that $n \gtrsim d$ samples and $T \gtrsim\log d$ gradient steps suffice to achieve strong (exact) recovery.

Problem

Research questions and friction points this paper is trying to address.

sample complexity

full-batch gradient descent

one-pass SGD

single-index model

statistical efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

full-batch gradient descent

one-pass SGD

sample complexity separation