VeriSim: A Configurable Framework for Evaluating Medical AI Under Realistic Patient Noise

📅 2026-04-11
📈 Citations: 0
Influential: 0
📄 PDF

career value

194K/year
🤖 AI Summary
This study addresses a critical gap in current medical AI evaluation benchmarks, which overlook communication noise arising in real clinical settings—such as patient memory lapses, limited health literacy, or anxiety—leading to inflated performance estimates that diverge from real-world effectiveness. To bridge this gap, the authors propose VeriSim, a configurable patient simulation framework that systematically models six empirically grounded noise dimensions derived from medical communication literature. VeriSim employs a hybrid verification mechanism combining UMLS and large language models to ensure injected noise remains medically plausible. Experiments reveal that realistic noise reduces diagnostic accuracy by 15–25% and increases dialogue length by 34–55% across seven open-source medical large language models, with standard fine-tuning proving insufficient for robustness. Clinician evaluations confirm high simulation fidelity (κ > 0.80), exposing a substantial sim-to-real performance gap.

Technology Category

Application Category

📝 Abstract
Medical large language models (LLMs) achieve impressive performance on standardized benchmarks, yet these evaluations fail to capture the complexity of real clinical encounters where patients exhibit memory gaps, limited health literacy, anxiety, and other communication barriers. We introduce VeriSim, a truth-preserving patient simulation framework that injects controllable, clinically evidence-grounded noise into patient responses while maintaining strict adherence to medical ground truth through a hybrid UMLS-LLM verification mechanism. Our framework operationalizes six noise dimensions derived from peer-reviewed medical communication literature, capturing authentic clinical phenomena such as patient recall limitations, health literacy barriers, and stigma-driven non-disclosure. Experiments across seven open-weight LLMs reveal that all models degrade significantly under realistic patient noise, with diagnostic accuracy dropping 15-25% and conversation length increasing 34-55%. Notably, smaller models (7B) show 40% greater degradation than larger models (70B+), while medical fine-tuning on standard corpora provides limited robustness benefits against patient communication noise. Evaluation by board-certified clinicians demonstrates high-quality simulation with strong inter-annotator agreement (kappa > 0.80), while LLM-as-a-Judge serves as a validated auxiliary evaluator achieving comparable reliability for scalable assessment. Our results highlight a critical Sim-to-Real gap in current medical AI. We release VeriSim as an open-source noise-injection framework, establishing a rigorous testbed for evaluating clinical robustness.
Problem

Research questions and friction points this paper is trying to address.

medical AI
patient noise
clinical robustness
communication barriers
Sim-to-Real gap
Innovation

Methods, ideas, or system contributions that make the work stand out.

patient simulation
realistic noise injection
medical LLM robustness
UMLS-LLM verification
Sim-to-Real gap
🔎 Similar Papers
No similar papers found.
S
Sina Mansouri
Department of Computer Science, George Mason University
M
Mohit Marvania
Department of Computer Science, George Mason University
V
Vibhavari Ashok Shihorkar
Department of Health Administration and Policy, College of Public Health, George Mason University
H
Han Ngoc Tran
Department of Health Administration and Policy, College of Public Health, George Mason University
K
Kazhal Shafiei
Department of Science, George Mason University
M
Mehrdad Fazli
Department of Computer Science, George Mason University
Y
Yikuan Li
Department of Health Administration and Policy, College of Public Health, George Mason University
Ziwei Zhu
Ziwei Zhu
Assistant Professor at George Mason University
data mininginformation retrievalmachine learningresponsible AI