LiBaGS: Lightweight Boundary Gap Synthesis for Targeted Synthetic Data Selection

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

This work addresses the challenge of introducing ineffective or redundant samples in synthetic data selection by proposing a lightweight, generator-agnostic filtering method. The approach integrates proximity to decision boundaries, prediction uncertainty, real-data density, and support validity to accurately identify informative samples that closely align with the true data manifold. It further introduces a novel margin-gap allocation strategy and a marginal-value stopping criterion, combined with soft labels and diversity-aware optimization, to prioritize sparse yet critical regions near classification boundaries. Experimental results demonstrate that the proposed method consistently outperforms conventional oversampling, strong augmentation, and single-criterion selection strategies across multiple benchmarks, yielding significant improvements in downstream task accuracy.

📝 Abstract

Synthetic data is useful only when the added samples fill missing parts of the training distribution that matter for the downstream task. We introduce LiBaGS, a lightweight, generator-agnostic method for targeted synthetic training data selection. LiBaGS scores candidate synthetic samples by combining decision-boundary proximity, predictive uncertainty, real-data density, and support validity, so that selected samples are both informative and likely to remain on the real data manifold. We then use a boundary-gap allocation rule that targets sparse but realistic decision-boundary neighborhoods, rather than simply adding more data or selecting only the most uncertain candidates. LiBaGS also learns when enough synthetic samples have been added through a marginal-value stopping rule, assigns softer labels near ambiguous boundaries, and uses a diversity objective to avoid redundant near-duplicate selections. Experiments show that LiBaGS improves accuracy over classical oversampling, hard augmentation, uncertainty and density ablations, and targeted-generation selection criteria.

Problem

Research questions and friction points this paper is trying to address.

synthetic data selection

decision boundary

training distribution

targeted generation

data augmentation

Innovation

Methods, ideas, or system contributions that make the work stand out.

synthetic data selection

decision-boundary proximity

boundary-gap allocation