π€ AI Summary
Current large language models demonstrate strong performance in single-turn medical diagnosis, yet their behavior remains unclear in multi-turn settings that more closely mirror real-world clinical reasoning involving iterative evidence accumulation. To address this gap, this work introduces MINT, a multi-turn medical diagnosis benchmark that systematically evaluates eleven large language models through high-fidelity case decomposition, clinician-annotated evidence segments, and controlled turn structures. The study reveals three previously undocumented diagnostic patterns: premature answering, limited self-correction capability, and susceptibility to strong misleading cues. Furthermore, it proposes actionable intervention strategies, demonstrating that delaying the modelβs initial response can improve first-answer accuracy by up to 62.6%, while deferring the presentation of critical evidence mitigates accuracy drops of up to 23.3%.
π Abstract
Large language models (LLMs) achieve high accuracy in medical diagnosis when all clinical information is provided in a single turn, yet how they behave under multi-turn evidence accumulation closer to real clinical reasoning remains unexplored. We introduce MINT (Medical Incremental N-Turn Benchmark), a high-fidelity, multi-turn medical diagnosis benchmark comprising 1,035 cases with clinically labeled evidence shards, controlled turn granularity, and information-preserving decomposition. Through systematic evaluation of 11 LLMs on MINT, we uncover three persistent behavioral patterns that significantly impact diagnostic decisions: (1) intent to answer, models rush to answer before sufficient evidence has been observed, with over 55% of answers committed within the first two turns; (2) self-correction, incorrect-to-correct answer revisions occur at up to 10.6 times the rate of correct-to-incorrect flips, revealing a latent capacity for self-correction that premature commitment forecloses; and (3) strong lures, clinically salient information such as laboratory results trigger premature answering even when models are explicitly instructed to wait. We translate these findings into clinically actionable guidance: deferring the diagnostic question to later turns reduces premature answering and improves accuracy at the first point of commitment by up to 62.6%, while reserving salient clinical evidence for later turns prevents a catastrophic accuracy drop of up to 23.3% caused by premature commitment. Our work provides both a controlled evaluation framework and concrete recommendations for improving the reliability of LLMs in multi-turn medical diagnosis.