đ€ AI Summary
This work addresses the unclear theoretical mechanism by which data augmentation improves generalization through invariance to label-irrelevant transformations. The authors propose an information-theoretic framework that models augmentation as a mixture of the original and transformed distributions, deriving a mutual informationâbased generalization bound. This bound is innovatively decomposed into fidelity, stability, and sensitivity components, revealing an inherent trade-off between generalization and invariance learning under augmentation. To unify the geometric characterization of augmentations, they introduce the notion of âgroup diameterâ and integrate orbit-averaged loss, sub-Gaussian assumptions, and geometric metrics under group actions into their analysis. Empirical results demonstrate that the proposed bound effectively tracks and predicts the true generalization gap, confirming the reliability and practical utility of the theoretical framework.
đ Abstract
Data augmentation is one of the most widely used techniques to improve generalization in modern machine learning, often justified by its ability to promote invariance to label-irrelevant transformations. However, its theoretical role remains only partially understood. In this work, we propose an information-theoretic framework that systematically accounts for the effect of augmentation on generalization and invariance learning. Our approach builds upon mutual information-based bounds, which relate the generalization gap to the amount of information a learning algorithm retains about its training data. We extend this framework by modeling the augmented distribution as a composition of the original data distribution with a distribution over transformations, which naturally induces an orbit-averaged loss function. Under mild sub-Gaussian assumptions on the loss function and the augmentation process, we derive a new generalization bound that decompose the expected generalization gap into three interpretable terms: (1) a distributional divergence between the original and augmented data, (2) a stability term measuring the algorithm dependence on training data, and (3) a sensitivity term capturing the effect of augmentation variability. To connect our bounds to the geometry of the augmentation group, we introduce the notion of group diameter, defined as the maximal perturbation that augmentations can induce in the input space. The group diameter provides a unified control parameter that bounds all three terms and highlights an intrinsic trade-off: small diameters preserve data fidelity but offer limited regularization, while large diameters enhance stability at the cost of increased bias and sensitivity. We validate our theoretical bounds with numerical experiments, demonstrating that it reliably tracks and predicts the behavior of the true generalization gap.