AI-generated data contamination erodes pathological variability and diagnostic reliability

📅 2026-01-19

🏛️ medRxiv

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

This study addresses the risks of training data contamination introduced by generative AI in medical domains, which can diminish pathological diversity and compromise diagnostic reliability. Analyzing over 800,000 synthetic clinical texts, multimodal medical images, and vision–language reports, the work reveals—under unvetted conditions—a self-referential feedback loop wherein model outputs converge toward generic phenotypes, obscuring critical pathological features. This leads to the disappearance of rare conditions such as pneumothorax, an error-induced reassurance rate rising to 40%, and clinical invalidation of documents after two generations of synthetic data reuse. The research further identifies a significant decoupling between diagnostic confidence and accuracy, and proposes an effective mitigation strategy combining quality-aware filtering with curated real-data mixing, establishing a new paradigm for preserving the integrity of medical data ecosystems.

Technology Category

Application Category

📝 Abstract

Generative artificial intelligence (AI) is rapidly populating medical records with synthetic content, creating a feedback loop where future models are increasingly at risk of training on uncurated AI generated data. However, the clinical consequences of this AI-generated data contamination remain unexplored. Here, we show that in the absence of mandatory human verification, this self referential cycle drives a rapid erosion of pathological variability and diagnostic reliability. By analysing more than 800,000 synthetic data points across clinical text generation, vision to language reporting, and medical image synthesis, we find that models progressively converge toward generic phenotypes regardless of the model architecture. Specifically, rare but critical findings, including pneumothorax and effusions, vanish from the synthetic content generated by AI models, while demographic representations skew heavily toward middle aged male phenotypes. Crucially, this degradation is masked by false diagnostic confidence; models continue to issue reassuring reports while failing to detect life threatening pathology, with false reassurance rates tripling to 40%. Blinded physician evaluation confirms that this decoupling of confidence and accuracy renders AI generated documentation clinically useless after just two generations. We systematically evaluate three mitigation strategies, finding that while synthetic volume scaling fails to prevent collapse, mixing real data with quality-aware filtering effectively preserves diversity. Ultimately, our results suggest that without policy mandated human oversight, the deployment of generative AI threatens to degrade the very healthcare data ecosystems it relies upon.

Problem

Research questions and friction points this paper is trying to address.

AI-generated data contamination

pathological variability

diagnostic reliability

synthetic medical data

healthcare data ecosystems

Innovation

Methods, ideas, or system contributions that make the work stand out.

AI-generated data contamination

pathological variability

diagnostic reliability