Characterizing the Generalization Error of Random Feature Regression with Arbitrary Data-Augmentation

📅 2026-05-11
📈 Citations: 0
Influential: 0
📄 PDF

career value

204K/year
🤖 AI Summary
This work investigates the regularization effect induced by data augmentation in supervised regression under the high-dimensional regime where both covariate dimension and sample size grow proportionally, and its impact on generalization error. Relying solely on the first- and second-order statistics of the true data distribution and the augmentation scheme, the study leverages random feature regression, high-dimensional statistical analysis, and spectral methods to provide, for the first time, a sharp asymptotic characterization of the generalization error under model misspecification and arbitrary network architectures when only the final layer is trained. The theoretical results are validated for their accuracy in Gaussian settings and quantitatively elucidate the mechanism by which data augmentation enhances generalization performance.
📝 Abstract
This paper aims at analyzing the regularization effect that data augmentation induces on supervised regression methods in the proportional regime, where the number of covariates grows proportionally to the number of samples. We provide a tight characterization of the test error, measured in mean squared error, in terms only of the population quantities of the true data, as well as first and second order statistics of the augmentation scheme. Our results are valid under misspecified feature maps, and for any network architecture where only the last readout layer is trained, and the rest of the network is either frozen or randomly initialized. We specify our results in the case of Gaussian data, and show that our asymptotic characterization is tight in this setting.
Problem

Research questions and friction points this paper is trying to address.

generalization error
random feature regression
data augmentation
proportional regime
regularization effect
Innovation

Methods, ideas, or system contributions that make the work stand out.

random feature regression
data augmentation
generalization error
proportional regime
mean squared error