🤖 AI Summary
This work addresses the challenge of introducing ineffective or redundant samples in synthetic data selection by proposing a lightweight, generator-agnostic filtering method. The approach integrates proximity to decision boundaries, prediction uncertainty, real-data density, and support validity to accurately identify informative samples that closely align with the true data manifold. It further introduces a novel margin-gap allocation strategy and a marginal-value stopping criterion, combined with soft labels and diversity-aware optimization, to prioritize sparse yet critical regions near classification boundaries. Experimental results demonstrate that the proposed method consistently outperforms conventional oversampling, strong augmentation, and single-criterion selection strategies across multiple benchmarks, yielding significant improvements in downstream task accuracy.
📝 Abstract
Synthetic data is useful only when the added samples fill missing parts of the training distribution that matter for the downstream task. We introduce LiBaGS, a lightweight, generator-agnostic method for targeted synthetic training data selection. LiBaGS scores candidate synthetic samples by combining decision-boundary proximity, predictive uncertainty, real-data density, and support validity, so that selected samples are both informative and likely to remain on the real data manifold. We then use a boundary-gap allocation rule that targets sparse but realistic decision-boundary neighborhoods, rather than simply adding more data or selecting only the most uncertain candidates. LiBaGS also learns when enough synthetic samples have been added through a marginal-value stopping rule, assigns softer labels near ambiguous boundaries, and uses a diversity objective to avoid redundant near-duplicate selections. Experiments show that LiBaGS improves accuracy over classical oversampling, hard augmentation, uncertainty and density ablations, and targeted-generation selection criteria.