SYNAPSE-G: Bridging Large Language Models and Graph Learning for Rare Event Classification

📅 2025-08-13

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

To address the challenge of extremely scarce positive labels in rare-event classification—severely limiting model performance—this paper proposes SYNAPSE-G. The framework first leverages large language models (LLMs) to generate high-quality synthetic positive examples as seeds; it then performs seed-driven positive discovery and expansion via semi-supervised label propagation over a similarity graph constructed from the original data. Theoretical analysis characterizes the trade-off between precision and recall induced by synthetic data quality. Moreover, SYNAPSE-G supports human or LLM-based feedback for calibration, enabling reliable positive identification even under cold-start conditions. Experiments on imbalanced benchmarks—including SST-2 and MHS—demonstrate that SYNAPSE-G significantly outperforms baselines such as k-nearest neighbors and pseudo-labeling, particularly improving recall and F1-score for rare events.

Technology Category

Application Category

📝 Abstract

Scarcity of labeled data, especially for rare events, hinders training effective machine learning models. This paper proposes SYNAPSE-G (Synthetic Augmentation for Positive Sampling via Expansion on Graphs), a novel pipeline leveraging Large Language Models (LLMs) to generate synthetic training data for rare event classification, addressing the cold-start problem. This synthetic data serve as seeds for semi-supervised label propagation on a similarity graph constructed between the seeds and a large unlabeled dataset. This identifies candidate positive examples, subsequently labeled by an oracle (human or LLM). The expanded dataset then trains/fine-tunes a classifier. We theoretically analyze how the quality (validity and diversity) of the synthetic data impacts the precision and recall of our method. Experiments on the imbalanced SST2 and MHS datasets demonstrate SYNAPSE-G's effectiveness in finding positive labels, outperforming baselines including nearest neighbor search.

Problem

Research questions and friction points this paper is trying to address.

Addressing scarcity of labeled data for rare events

Generating synthetic training data using LLMs

Improving rare event classification via graph-based label propagation

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs generate synthetic rare event data

Semi-supervised label propagation on graphs

Oracle validates expanded training dataset

🔎 Similar Papers

LLM-Enhanced User-Item Interactions: Leveraging Edge Information for Optimized Recommendations