Full-Batch Gradient Descent Outperforms One-Pass SGD: Sample Complexity Separation in Single-Index Learning

📅 2026-02-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the statistical efficiency gap between full-batch gradient descent (GD) and single-pass stochastic gradient descent (SGD) in terms of sample complexity under the single-index model. By introducing a truncated quadratic activation, analyzing trajectory dynamics under spherical constraints, and providing a refined characterization of both the correlated and squared losses, the study establishes—for the first time in a nonlinear single-index setting—that full-batch GD achieves strong parameter recovery when the sample size satisfies \(n \gtrsim d\) and the number of iterations \(T \gtrsim \log d\). This result breaks the logarithmic barrier inherent to single-pass SGD, which requires \(n \gtrsim d \log d\). The findings demonstrate GD’s superiority in both weak (\(n \approx d\)) and strong recovery regimes, highlighting the critical roles of the optimization landscape and initialization in determining statistical efficiency.

Technology Category

Application Category

📝 Abstract
It is folklore that reusing training data more than once can improve the statistical efficiency of gradient-based learning. However, beyond linear regression, the theoretical advantage of full-batch gradient descent (GD, which always reuses all the data) over one-pass stochastic gradient descent (online SGD, which uses each data point only once) remains unclear. In this work, we consider learning a $d$-dimensional single-index model with a quadratic activation, for which it is known that one-pass SGD requires $n\gtrsim d\log d$ samples to achieve weak recovery. We first show that this $\log d$ factor in the sample complexity persists for full-batch spherical GD on the correlation loss; however, by simply truncating the activation, full-batch GD exhibits a favorable optimization landscape at $n \simeq d$ samples, thereby outperforming one-pass SGD (with the same activation) in statistical efficiency. We complement this result with a trajectory analysis of full-batch GD on the squared loss from small initialization, showing that $n \gtrsim d$ samples and $T \gtrsim\log d$ gradient steps suffice to achieve strong (exact) recovery.
Problem

Research questions and friction points this paper is trying to address.

sample complexity
full-batch gradient descent
one-pass SGD
single-index model
statistical efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

full-batch gradient descent
one-pass SGD
sample complexity separation
single-index model
activation truncation
🔎 Similar Papers
No similar papers found.