Fidelity, Diversity, and Privacy: A Multi-Dimensional LLM Evaluation for Clinical Data Augmentation

📅 2026-04-29

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This study addresses the scarcity of high-quality, clinically annotated data and the constraints imposed by privacy regulations, which hinder the advancement of medical machine learning. To overcome these challenges, the authors propose a conditional generation approach leveraging large language models—specifically DeepSeek-R1, OpenBioLLM-Llama3, and Qwen 3.5—to synthesize mental health assessment reports aligned with ICD-10 coding standards. They further introduce the first multidimensional evaluation framework tailored for clinical data augmentation, which jointly assesses semantic fidelity, lexical diversity, and privacy preservation in generated texts. Experimental results demonstrate that the synthesized reports maintain clinical plausibility while effectively mitigating privacy risks, thereby substantially expanding the pool of training data available for clinical natural language processing tasks.

📝 Abstract

The scarcity of high-quality annotated medical data, particularly in mental health, poses a significant bottleneck for training robust machine learning models. Privacy regulations restrict data sharing, making synthetic data generation a promising alternative. The use of Large Language Models (LLMs) in a data augmentation pipeline could be leveraged as an alternative in this field. In the proposed methodology, DeepSeek-R1, OpenBioLLM-Llama3 and Qwen 3.5 are used to generate synthetic mental health evaluation reports conditioned on specific International Classification of Diseases, Tenth Revision (ICD-10) codes. Because naive text generation can lead to mode collapse or privacy breaches (memorization), a comprehensive evaluation framework is introduced. The generated diagnostic texts are assessed across three dimensions: semantic fidelity, lexical diversity, and privacy/plagiarism. The results demonstrate that all models can generate clinically coherent, diverse, and privacy-safe synthetic reports, significantly expanding the available training data for clinical natural language processing tasks without compromising patient confidentiality.

Problem

Research questions and friction points this paper is trying to address.

data scarcity

privacy regulations

clinical data augmentation

mental health

synthetic data

Innovation

Methods, ideas, or system contributions that make the work stand out.

synthetic data generation

large language models

clinical data augmentation