🤖 AI Summary
Large language models (LLMs) demonstrate strong performance in static medical question-answering tasks but exhibit significantly degraded capabilities in clinical diagnostic scenarios requiring multi-turn evidence gathering. To address this gap, this work introduces a standardized patient simulator grounded in Objective Structured Clinical Examination (OSCE) principles and establishes a reproducible benchmark for interactive diagnostic reasoning. The study systematically reveals, for the first time, that static evaluations substantially overestimate LLMs’ true diagnostic proficiency and proposes a complementary interactive evaluation paradigm. Integrating multi-turn dialogue interaction with dual-dimensional assessment of diagnostic accuracy and evidential quality, experiments across 468 clinical cases and 15 models show that multi-turn evidence collection leads to a 12.75% drop in diagnostic accuracy and a 24.36% reduction in supporting evidence quality.
📝 Abstract
Large language models perform well on static medical examinations, yet clinical diagnosis often requires iterative evidence gathering under uncertainty. Building on prior interactive evaluation efforts, we introduce an OSCE-inspired standardized patient simulator and a controlled, reproducible benchmark for active diagnostic inquiry. Across 468 cases and 15 models in our protocol, we observe that multi-turn evidence seeking reduces diagnostic accuracy by 12.75% and lowers supporting-evidence quality by 24.36% relative to full-context evaluation; error analyses associate these drops with premature diagnostic closure and inefficient questioning. Together, these results suggest that static full-context benchmarks may overestimate performance in interactive evidence-seeking settings, motivating complementary interactive assessment for safer clinical decision support.