Embedding-Driven Diversity Sampling to Improve Few-Shot Synthetic Data Generation

๐Ÿ“… 2025-01-20
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Clinical text classification suffers from scarcity of high-quality labeled data, heavy reliance on expert annotation, and dependence on large volumes of real-world samples. Method: We propose an embedding-space diversity-driven few-shot synthetic data generation method. Leveraging a pre-trained language model, we construct few-shot prompts from a small set of real medical reports (CheXpert) and employ cosine similarityโ€“based diversity sampling over clinical note embeddings to guide a large language model in generating syntactically well-formed and clinically aligned synthetic reports. Contribution/Results: This is the first work to integrate embedding-space diversity sampling with few-shot synthesis, substantially improving clinical fidelity and task utility of synthetic data. Experiments show the synthetic data achieves 90% efficacy relative to real data, with AUROC and AUPRC improvements of 57% and 68%, respectively. Moreover, only 60โ€“70% of real samples are needed to match baseline performance. Turing test evaluations confirm significantly superior clinical plausibility compared to random or zero-shot baselines.

Technology Category

Application Category

๐Ÿ“ Abstract
Accurate classification of clinical text often requires fine-tuning pre-trained language models, a process that is costly and time-consuming due to the need for high-quality data and expert annotators. Synthetic data generation offers an alternative, though pre-trained models may not capture the syntactic diversity of clinical notes. We propose an embedding-driven approach that uses diversity sampling from a small set of real clinical notes to guide large language models in few-shot prompting, generating synthetic text that better reflects clinical syntax. We evaluated this method using the CheXpert dataset on a classification task, comparing it to random few-shot and zero-shot approaches. Using cosine similarity and a Turing test, our approach produced synthetic notes that more closely align with real clinical text. Our pipeline reduced the data needed to reach the 0.85 AUC cutoff by 40% for AUROC and 30% for AUPRC, while augmenting models with synthetic data improved AUROC by 57% and AUPRC by 68%. Additionally, our synthetic data was 0.9 times as effective as real data, a 60% improvement in value.
Problem

Research questions and friction points this paper is trying to address.

Medical Report Synthesis
Data Augmentation
Language Model Training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diversity Sampling
Data Embedding
Synthetic Data Generation
๐Ÿ”Ž Similar Papers
No similar papers found.
Ivan Lopez
Ivan Lopez
Stanford University
data sciencemachine learningNLPhealth systemsclinical decision support
F
Fateme Nateghi Haredasht
Center for Biomedical Informatics Research, Stanford, CA, USA
K
Kaitlin Caoili
The Ohio State University College of Medicine, Columbus, OH, USA
Jonathan H Chen
Jonathan H Chen
Stanford Department of Medicine
Medical Data ScienceInternal MedicineAIMachine LearningClinical Decision Support
A
Akshay Chaudhari
Department of Radiology, Stanford, CA, USA; Cardiovascular Institute, Stanford, CA, USA