Gaussian and Non-Gaussian Universality of Data Augmentation

📅 2022-02-18

📈 Citations: 3

✨ Influential: 1

career value

200K/year

🤖 AI Summary

This work systematically investigates how data augmentation affects the variance and asymptotic distribution of estimators, revealing that its efficacy is not universal: in high-dimensional regimes, augmentation can increase uncertainty in empirical prediction risk, fail to regularize effectively, and even shift the peak of the double-descent curve. The impact depends on a delicate interplay among data distribution, estimator properties, sample size, number of augmentations, and dimensionality. To address this, we propose a block-dependent adaptation technique based on the Lindeberg method, integrating random matrix theory with asymptotic statistical inference to construct the first general analytical framework applicable to both Gaussian and non-Gaussian data. This framework enables the first rigorous quantification of augmentation effects, explains several counterintuitive empirical phenomena, and validates theoretical predictions on canonical models including ridge regression and minimum-norm interpolation.

📝 Abstract

We provide universality results that quantify how data augmentation affects the variance and limiting distribution of estimates through simple surrogates, and analyze several specific models in detail. The results confirm some observations made in machine learning practice, but also lead to unexpected findings: Data augmentation may increase rather than decrease the uncertainty of estimates, such as the empirical prediction risk. It can act as a regularizer, but fails to do so in certain high-dimensional problems, and it may shift the double-descent peak of an empirical risk. Overall, the analysis shows that several properties data augmentation has been attributed with are not either true or false, but rather depend on a combination of factors -- notably the data distribution, the properties of the estimator, and the interplay of sample size, number of augmentations, and dimension. As our main theoretical tool, we develop an adaptation of Lindeberg's technique for block dependence. The resulting universality regime may be Gaussian or non-Gaussian.

Problem

Research questions and friction points this paper is trying to address.

Quantify data augmentation's impact on estimate variance and distribution.

Analyze data augmentation's role as a regularizer in high-dimensional problems.

Determine conditions affecting data augmentation's influence on empirical risk.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapted Lindeberg's technique for block dependence

Analyzed Gaussian and non-Gaussian universality regimes

Explored data augmentation's impact on estimate uncertainty

🔎 Similar Papers

Data augmentation with automated machine learning: approaches and performance comparison with classical data augmentation methods