AI-generated data contamination erodes pathological variability and diagnostic reliability

📅 2026-01-19
🏛️ medRxiv
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the risks of training data contamination introduced by generative AI in medical domains, which can diminish pathological diversity and compromise diagnostic reliability. Analyzing over 800,000 synthetic clinical texts, multimodal medical images, and vision–language reports, the work reveals—under unvetted conditions—a self-referential feedback loop wherein model outputs converge toward generic phenotypes, obscuring critical pathological features. This leads to the disappearance of rare conditions such as pneumothorax, an error-induced reassurance rate rising to 40%, and clinical invalidation of documents after two generations of synthetic data reuse. The research further identifies a significant decoupling between diagnostic confidence and accuracy, and proposes an effective mitigation strategy combining quality-aware filtering with curated real-data mixing, establishing a new paradigm for preserving the integrity of medical data ecosystems.

Technology Category

Application Category

📝 Abstract
Generative artificial intelligence (AI) is rapidly populating medical records with synthetic content, creating a feedback loop where future models are increasingly at risk of training on uncurated AI generated data. However, the clinical consequences of this AI-generated data contamination remain unexplored. Here, we show that in the absence of mandatory human verification, this self referential cycle drives a rapid erosion of pathological variability and diagnostic reliability. By analysing more than 800,000 synthetic data points across clinical text generation, vision to language reporting, and medical image synthesis, we find that models progressively converge toward generic phenotypes regardless of the model architecture. Specifically, rare but critical findings, including pneumothorax and effusions, vanish from the synthetic content generated by AI models, while demographic representations skew heavily toward middle aged male phenotypes. Crucially, this degradation is masked by false diagnostic confidence; models continue to issue reassuring reports while failing to detect life threatening pathology, with false reassurance rates tripling to 40%. Blinded physician evaluation confirms that this decoupling of confidence and accuracy renders AI generated documentation clinically useless after just two generations. We systematically evaluate three mitigation strategies, finding that while synthetic volume scaling fails to prevent collapse, mixing real data with quality-aware filtering effectively preserves diversity. Ultimately, our results suggest that without policy mandated human oversight, the deployment of generative AI threatens to degrade the very healthcare data ecosystems it relies upon.
Problem

Research questions and friction points this paper is trying to address.

AI-generated data contamination
pathological variability
diagnostic reliability
synthetic medical data
healthcare data ecosystems
Innovation

Methods, ideas, or system contributions that make the work stand out.

AI-generated data contamination
pathological variability
diagnostic reliability
synthetic data feedback loop
quality-aware filtering
🔎 Similar Papers
No similar papers found.
H
Hongyu He
National University of Singapore, Singapore
S
Shaowen Xiang
National University of Singapore, Singapore
Y
Ye Zhang
National University of Singapore, Singapore
Y
Yingtao Zhu
National University of Singapore, Singapore
J
Jin Zhang
National University of Singapore, Singapore
Hao Deng
Hao Deng
Engineer
recommendation system
Emily Alsentzer
Emily Alsentzer
Assistant Professor, Stanford University
machine learning for healthcare
Qingyu Chen
Qingyu Chen
Biomedical Informatics & Data Science, Yale University; NCBI-NLM, National Institutes of Health
Text miningMachine learningData curationBioNLPMedical Imaging Analysis
Kun-Hsing Yu
Kun-Hsing Yu
Harvard Medical School
A
Andrew Marshall
Harvard University, MA, USA; Google, CA, USA
Tingting Chen
Tingting Chen
National University of Singapore
Machine LearningComputer Vision
Srinivas Anumasa
Srinivas Anumasa
Post Doctoral Researcher
Machine learningDiffusionspiking neural networksNeural ODE
D
Daniel Ebner
Mayo Clinic, MN, USA
D
Dean Ho
National University of Singapore, Singapore
K
K. Ngiam
National University of Singapore, Singapore
C
Ching-Yu Cheng
National University of Singapore, Singapore
Dianbo Liu
Dianbo Liu
Assistant professor, National University of Singapore
Push the limits of humanmachine learningbiomedical sciences