🤖 AI Summary
To address the challenge of extremely scarce positive labels in rare-event classification—severely limiting model performance—this paper proposes SYNAPSE-G. The framework first leverages large language models (LLMs) to generate high-quality synthetic positive examples as seeds; it then performs seed-driven positive discovery and expansion via semi-supervised label propagation over a similarity graph constructed from the original data. Theoretical analysis characterizes the trade-off between precision and recall induced by synthetic data quality. Moreover, SYNAPSE-G supports human or LLM-based feedback for calibration, enabling reliable positive identification even under cold-start conditions. Experiments on imbalanced benchmarks—including SST-2 and MHS—demonstrate that SYNAPSE-G significantly outperforms baselines such as k-nearest neighbors and pseudo-labeling, particularly improving recall and F1-score for rare events.
📝 Abstract
Scarcity of labeled data, especially for rare events, hinders training effective machine learning models. This paper proposes SYNAPSE-G (Synthetic Augmentation for Positive Sampling via Expansion on Graphs), a novel pipeline leveraging Large Language Models (LLMs) to generate synthetic training data for rare event classification, addressing the cold-start problem. This synthetic data serve as seeds for semi-supervised label propagation on a similarity graph constructed between the seeds and a large unlabeled dataset. This identifies candidate positive examples, subsequently labeled by an oracle (human or LLM). The expanded dataset then trains/fine-tunes a classifier. We theoretically analyze how the quality (validity and diversity) of the synthetic data impacts the precision and recall of our method. Experiments on the imbalanced SST2 and MHS datasets demonstrate SYNAPSE-G's effectiveness in finding positive labels, outperforming baselines including nearest neighbor search.