Active Evidence-Seeking and Diagnostic Reasoning in Large Language Models for Clinical Decision Support

📅 2026-05-21
📈 Citations: 0
Influential: 0
📄 PDF

career value

181K/year
🤖 AI Summary
Large language models (LLMs) demonstrate strong performance in static medical question-answering tasks but exhibit significantly degraded capabilities in clinical diagnostic scenarios requiring multi-turn evidence gathering. To address this gap, this work introduces a standardized patient simulator grounded in Objective Structured Clinical Examination (OSCE) principles and establishes a reproducible benchmark for interactive diagnostic reasoning. The study systematically reveals, for the first time, that static evaluations substantially overestimate LLMs’ true diagnostic proficiency and proposes a complementary interactive evaluation paradigm. Integrating multi-turn dialogue interaction with dual-dimensional assessment of diagnostic accuracy and evidential quality, experiments across 468 clinical cases and 15 models show that multi-turn evidence collection leads to a 12.75% drop in diagnostic accuracy and a 24.36% reduction in supporting evidence quality.
📝 Abstract
Large language models perform well on static medical examinations, yet clinical diagnosis often requires iterative evidence gathering under uncertainty. Building on prior interactive evaluation efforts, we introduce an OSCE-inspired standardized patient simulator and a controlled, reproducible benchmark for active diagnostic inquiry. Across 468 cases and 15 models in our protocol, we observe that multi-turn evidence seeking reduces diagnostic accuracy by 12.75% and lowers supporting-evidence quality by 24.36% relative to full-context evaluation; error analyses associate these drops with premature diagnostic closure and inefficient questioning. Together, these results suggest that static full-context benchmarks may overestimate performance in interactive evidence-seeking settings, motivating complementary interactive assessment for safer clinical decision support.
Problem

Research questions and friction points this paper is trying to address.

active evidence-seeking
diagnostic reasoning
clinical decision support
interactive evaluation
large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

active evidence-seeking
diagnostic reasoning
interactive evaluation
standardized patient simulator
clinical decision support
Chen Zhan
Chen Zhan
Bioinformatician / Research Fellow, University of Adelaide
BioinformaticsData MiningPharmacoepidemiologyArtificial Intelligence
Xihe Qiu
Xihe Qiu
Associate Professor, Shanghai University of Engineering Science
AI for HealthcareVision-Language ModelsReinforcement LearningLarge Language Models
X
Xiaoyu Tan
Tencent Youtu Lab, Shanghai, 200233, China.
X
Xibing Zhuang
Department of Oncology, Jinshan Hospital, Fudan University, Shanghai, 201508, China.
G
Gengchen Ma
School of Electronic and Electrical Engineering, Shanghai University of Engineering Science, Shanghai, 201620, China.
Y
Yue Zhang
School of Electronic and Electrical Engineering, Shanghai University of Engineering Science, Shanghai, 201620, China.
Shuo Li
Shuo Li
Fellow of SPIE, AIMBE, AAIA, IET, and IAMBE; Chair Professor, Case Western Reserve University
Artificial IntelligenceVision-Language ModelMachine LearningMedical Image Analysis
P
Peifeng Liu
State Key Laboratory of Systems Medicine for Cancer, Shanghai Cancer Institute, Renji Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai, 200032, China.
X
Xiaoxiao Ge
Integrative Clinical Research Ward, Clinical Medicine Research Institute, Zhongshan Hospital, Fudan University, Shanghai, 200032, China.
Liang Liu
Liang Liu
Highly Cited Researcher, EEE Department, The Hong Kong Polytechnic University
convex optimizationMIMOInternet of Thingsintegrated sensing and communication
L
Lu Gan
Department of Medical Oncology, Zhongshan Hospital, Fudan University, Shanghai, 200032, China.