๐ค AI Summary
Clinical text classification suffers from scarcity of high-quality labeled data, heavy reliance on expert annotation, and dependence on large volumes of real-world samples.
Method: We propose an embedding-space diversity-driven few-shot synthetic data generation method. Leveraging a pre-trained language model, we construct few-shot prompts from a small set of real medical reports (CheXpert) and employ cosine similarityโbased diversity sampling over clinical note embeddings to guide a large language model in generating syntactically well-formed and clinically aligned synthetic reports.
Contribution/Results: This is the first work to integrate embedding-space diversity sampling with few-shot synthesis, substantially improving clinical fidelity and task utility of synthetic data. Experiments show the synthetic data achieves 90% efficacy relative to real data, with AUROC and AUPRC improvements of 57% and 68%, respectively. Moreover, only 60โ70% of real samples are needed to match baseline performance. Turing test evaluations confirm significantly superior clinical plausibility compared to random or zero-shot baselines.
๐ Abstract
Accurate classification of clinical text often requires fine-tuning pre-trained language models, a process that is costly and time-consuming due to the need for high-quality data and expert annotators. Synthetic data generation offers an alternative, though pre-trained models may not capture the syntactic diversity of clinical notes. We propose an embedding-driven approach that uses diversity sampling from a small set of real clinical notes to guide large language models in few-shot prompting, generating synthetic text that better reflects clinical syntax. We evaluated this method using the CheXpert dataset on a classification task, comparing it to random few-shot and zero-shot approaches. Using cosine similarity and a Turing test, our approach produced synthetic notes that more closely align with real clinical text. Our pipeline reduced the data needed to reach the 0.85 AUC cutoff by 40% for AUROC and 30% for AUPRC, while augmenting models with synthetic data improved AUROC by 57% and AUPRC by 68%. Additionally, our synthetic data was 0.9 times as effective as real data, a 60% improvement in value.