🤖 AI Summary
It remains unclear whether synthetic data generated by generative models can improve classifier generalization, and existing heuristic selection criteria lack theoretical foundations. Method: We systematically analyze the impact of synthetic data selection on generalization error from a high-dimensional regression perspective, identifying covariance shift—not mean shift—as the key limiting factor. Based on this insight, we propose the “covariance matching” principle and prove its optimality under mild conditions. The framework is applicable to deep neural networks and mainstream generative models. Contribution/Results: Through rigorous theoretical analysis, linear model studies, and extensive empirical validation across diverse architectures, datasets, and generative models (e.g., GANs, VAEs, diffusion models), we demonstrate that covariance matching consistently outperforms existing synthetic data selection strategies, yielding stable and significant improvements in classifier prediction performance.
📝 Abstract
Despite the progress in the development of generative models, their usefulness in creating synthetic data that improve prediction performance of classifiers has been put into question. Besides heuristic principles such as "synthetic data should be close to the real data distribution", it is actually not clear which specific properties affect the generalization error. Our paper addresses this question through the lens of high-dimensional regression. Theoretically, we show that, for linear models, the covariance shift between the target distribution and the distribution of the synthetic data affects the generalization error but, surprisingly, the mean shift does not. Furthermore we prove that, in some settings, matching the covariance of the target distribution is optimal. Remarkably, the theoretical insights from linear models carry over to deep neural networks and generative models. We empirically demonstrate that the covariance matching procedure (matching the covariance of the synthetic data with that of the data coming from the target distribution) performs well against several recent approaches for synthetic data selection, across training paradigms, architectures, datasets and generative models used for augmentation.