🤖 AI Summary
This work systematically investigates how data augmentation affects the variance and asymptotic distribution of estimators, revealing that its efficacy is not universal: in high-dimensional regimes, augmentation can increase uncertainty in empirical prediction risk, fail to regularize effectively, and even shift the peak of the double-descent curve. The impact depends on a delicate interplay among data distribution, estimator properties, sample size, number of augmentations, and dimensionality. To address this, we propose a block-dependent adaptation technique based on the Lindeberg method, integrating random matrix theory with asymptotic statistical inference to construct the first general analytical framework applicable to both Gaussian and non-Gaussian data. This framework enables the first rigorous quantification of augmentation effects, explains several counterintuitive empirical phenomena, and validates theoretical predictions on canonical models including ridge regression and minimum-norm interpolation.
📝 Abstract
We provide universality results that quantify how data augmentation affects the variance and limiting distribution of estimates through simple surrogates, and analyze several specific models in detail. The results confirm some observations made in machine learning practice, but also lead to unexpected findings: Data augmentation may increase rather than decrease the uncertainty of estimates, such as the empirical prediction risk. It can act as a regularizer, but fails to do so in certain high-dimensional problems, and it may shift the double-descent peak of an empirical risk. Overall, the analysis shows that several properties data augmentation has been attributed with are not either true or false, but rather depend on a combination of factors -- notably the data distribution, the properties of the estimator, and the interplay of sample size, number of augmentations, and dimension. As our main theoretical tool, we develop an adaptation of Lindeberg's technique for block dependence. The resulting universality regime may be Gaussian or non-Gaussian.