Impact of Label Noise from Large Language Models Generated Annotations on Evaluation of Diagnostic Model Performance

📅 2025-06-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work identifies a systematic bias in evaluating downstream diagnostic models when labels are generated by large language models (LLMs), stemming from label noise inherent in LLM outputs. We propose a synthetic data simulation framework integrating disease prevalence, LLM sensitivity, and specificity, coupled with Monte Carlo simulations and theoretical error-bound derivation. Our analysis reveals, for the first time, that the resulting evaluation bias is strongly prevalence-dependent: specificity dominates distortion at low prevalence, whereas sensitivity dominates at high prevalence. Theoretically, even a modest drop in LLM specificity to 95%—at 10% disease prevalence—can erroneously downgrade a perfectly sensitive (100%) model to an estimated sensitivity of ~53%. Monte Carlo experiments confirm the robustness and uncertainty bounds of this downward bias. Based on these findings, we introduce the “prevalence-aware prompting” principle—a methodological foundation and practical guideline for trustworthy evaluation of diagnostic models trained or validated using LLM-generated annotations.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) are increasingly used to generate labels from radiology reports to enable large-scale AI evaluation. However, label noise from LLMs can introduce bias into performance estimates, especially under varying disease prevalence and model quality. This study quantifies how LLM labeling errors impact downstream diagnostic model evaluation. We developed a simulation framework to assess how LLM label errors affect observed model performance. A synthetic dataset of 10,000 cases was generated across different prevalence levels. LLM sensitivity and specificity were varied independently between 90% and 100%. We simulated diagnostic models with true sensitivity and specificity ranging from 90% to 100%. Observed performance was computed using LLM-generated labels as the reference. We derived analytical performance bounds and ran 5,000 Monte Carlo trials per condition to estimate empirical uncertainty. Observed performance was highly sensitive to LLM label quality, with bias strongly influenced by disease prevalence. In low-prevalence settings, small reductions in LLM specificity led to substantial underestimation of sensitivity. For example, at 10% prevalence, an LLM with 95% specificity yielded an observed sensitivity of ~53% despite a perfect model. In high-prevalence scenarios, reduced LLM sensitivity caused underestimation of model specificity. Monte Carlo simulations consistently revealed downward bias, with observed performance often falling below true values even when within theoretical bounds. LLM-generated labels can introduce systematic, prevalence-dependent bias into model evaluation. Specificity is more critical in low-prevalence tasks, while sensitivity dominates in high-prevalence settings. These findings highlight the importance of prevalence-aware prompt design and error characterization when using LLMs for post-deployment model assessment in clinical AI.
Problem

Research questions and friction points this paper is trying to address.

Quantifies impact of LLM label errors on diagnostic model evaluation
Assesses bias in performance estimates from LLM-generated labels
Examines prevalence-dependent effects on model sensitivity and specificity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Simulation framework assesses LLM label errors
Monte Carlo trials estimate empirical uncertainty
Prevalence-aware prompt design reduces bias
🔎 Similar Papers
No similar papers found.
Mohammadreza Chavoshi
Mohammadreza Chavoshi
MD, Postdoctoral Researcher, Emory University
Radiologymeta-analysisArtificial Intelligence
Hari Trivedi
Hari Trivedi
Emory University
Deep LearningRadiologyMammographyAINatural Language Processing
J
Janice M. Newsome
Department of Radiology, Emory University, Atlanta, GA, USA
A
Aawez Mansuri
Department of Radiology, Emory University, Atlanta, GA, USA
C
Chiratidzo Rudado Sanyika
Department of Radiology, Emory University, Atlanta, GA, USA
R
Rohan Isaac
Department of Radiology, Emory University, Atlanta, GA, USA
F
Frank Li
Department of Radiology, Emory University, Atlanta, GA, USA
T
T. Dapamede
Department of Radiology, Emory University, Atlanta, GA, USA
J
J. Gichoya
Department of Radiology, Emory University, Atlanta, GA, USA