Fidelity, Diversity, and Privacy: A Multi-Dimensional LLM Evaluation for Clinical Data Augmentation

📅 2026-04-29
📈 Citations: 0
Influential: 0
📄 PDF

career value

204K/year
🤖 AI Summary
This study addresses the scarcity of high-quality, clinically annotated data and the constraints imposed by privacy regulations, which hinder the advancement of medical machine learning. To overcome these challenges, the authors propose a conditional generation approach leveraging large language models—specifically DeepSeek-R1, OpenBioLLM-Llama3, and Qwen 3.5—to synthesize mental health assessment reports aligned with ICD-10 coding standards. They further introduce the first multidimensional evaluation framework tailored for clinical data augmentation, which jointly assesses semantic fidelity, lexical diversity, and privacy preservation in generated texts. Experimental results demonstrate that the synthesized reports maintain clinical plausibility while effectively mitigating privacy risks, thereby substantially expanding the pool of training data available for clinical natural language processing tasks.
📝 Abstract
The scarcity of high-quality annotated medical data, particularly in mental health, poses a significant bottleneck for training robust machine learning models. Privacy regulations restrict data sharing, making synthetic data generation a promising alternative. The use of Large Language Models (LLMs) in a data augmentation pipeline could be leveraged as an alternative in this field. In the proposed methodology, DeepSeek-R1, OpenBioLLM-Llama3 and Qwen 3.5 are used to generate synthetic mental health evaluation reports conditioned on specific International Classification of Diseases, Tenth Revision (ICD-10) codes. Because naive text generation can lead to mode collapse or privacy breaches (memorization), a comprehensive evaluation framework is introduced. The generated diagnostic texts are assessed across three dimensions: semantic fidelity, lexical diversity, and privacy/plagiarism. The results demonstrate that all models can generate clinically coherent, diverse, and privacy-safe synthetic reports, significantly expanding the available training data for clinical natural language processing tasks without compromising patient confidentiality.
Problem

Research questions and friction points this paper is trying to address.

data scarcity
privacy regulations
clinical data augmentation
mental health
synthetic data
Innovation

Methods, ideas, or system contributions that make the work stand out.

synthetic data generation
large language models
clinical data augmentation
privacy-preserving evaluation
semantic fidelity
🔎 Similar Papers
No similar papers found.
G
Guillermo Iglesias
Departamento de Sistemas Informáticos, Escuela Técnica Superior de Ingeniería de Sistemas Informáticos, Universidad Politécnica de Madrid, Spain
G
Gema Bello-Orgaz
Departamento de Sistemas Informáticos, Escuela Técnica Superior de Ingeniería de Sistemas Informáticos, Universidad Politécnica de Madrid, Spain
María Navas-Loro
María Navas-Loro
Profesor Ayudante Doctor (Assistant Professor)
Inteligencia ArtificialProcesamiento del LenguajeRepresentación del ConocimientoÉtica
Cristian Ramirez-Atencia
Cristian Ramirez-Atencia
Computer Systems Department, Universidad Politécnica de Madrid
Multi-Objective OptimizationEvolutionary AlgorithmsMultiple Criteria Decision MakingConstraint
M
Mercè Salvador Robert
Department of Psychiatry, University Hospital Rey Juan Carlos, Mostoles, Spain
E
Enrique Baca-Garcia
Department of Psychiatry, University Hospital Jimenez Diaz Foundation, Madrid, Spain