AugGen: Synthetic Augmentation Can Improve Discriminative Models

📅 2025-03-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the scarcity of labeled data in privacy-sensitive scenarios and the reliance of existing synthetic data methods on external datasets or pre-trained models, this paper proposes a fully self-contained, closed-loop synthetic augmentation paradigm. Our approach trains a conditional generative model exclusively on the target dataset and directly samples synthetic instances from it—requiring no external data or pre-trained weights. By jointly optimizing the generator and discriminator, the method achieves endogenous data augmentation. Evaluated on the IJB-C and IJB-B face recognition benchmarks, it improves identification accuracy by 1–12% over real-data-only baselines and outperforms state-of-the-art synthetic-data methods. Notably, it is the first to demonstrate that synthetic augmentation alone can surpass performance gains achieved through mainstream network architecture improvements. This work establishes an efficient, privacy-compliant pathway for enhancing model performance under strict data governance constraints.

Technology Category

Application Category

📝 Abstract
The increasing dependence on large-scale datasets in machine learning introduces significant privacy and ethical challenges. Synthetic data generation offers a promising solution; however, most current methods rely on external datasets or pre-trained models, which add complexity and escalate resource demands. In this work, we introduce a novel self-contained synthetic augmentation technique that strategically samples from a conditional generative model trained exclusively on the target dataset. This approach eliminates the need for auxiliary data sources. Applied to face recognition datasets, our method achieves 1--12% performance improvements on the IJB-C and IJB-B benchmarks. It outperforms models trained solely on real data and exceeds the performance of state-of-the-art synthetic data generation baselines. Notably, these enhancements often surpass those achieved through architectural improvements, underscoring the significant impact of synthetic augmentation in data-scarce environments. These findings demonstrate that carefully integrated synthetic data not only addresses privacy and resource constraints but also substantially boosts model performance. Project page https://parsa-ra.github.io/auggen
Problem

Research questions and friction points this paper is trying to address.

Addresses privacy and ethical challenges in machine learning datasets.
Eliminates need for external datasets in synthetic data generation.
Improves model performance in data-scarce environments using synthetic augmentation.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-contained synthetic augmentation technique
Conditional generative model on target dataset
Eliminates need for auxiliary data sources
🔎 Similar Papers
2024-06-20Neural Information Processing SystemsCitations: 0