Benchmarking Multi-turn Medical Diagnosis: Hold, Lure, and Self-Correction

πŸ“… 2026-04-05
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Current large language models demonstrate strong performance in single-turn medical diagnosis, yet their behavior remains unclear in multi-turn settings that more closely mirror real-world clinical reasoning involving iterative evidence accumulation. To address this gap, this work introduces MINT, a multi-turn medical diagnosis benchmark that systematically evaluates eleven large language models through high-fidelity case decomposition, clinician-annotated evidence segments, and controlled turn structures. The study reveals three previously undocumented diagnostic patterns: premature answering, limited self-correction capability, and susceptibility to strong misleading cues. Furthermore, it proposes actionable intervention strategies, demonstrating that delaying the model’s initial response can improve first-answer accuracy by up to 62.6%, while deferring the presentation of critical evidence mitigates accuracy drops of up to 23.3%.
πŸ“ Abstract
Large language models (LLMs) achieve high accuracy in medical diagnosis when all clinical information is provided in a single turn, yet how they behave under multi-turn evidence accumulation closer to real clinical reasoning remains unexplored. We introduce MINT (Medical Incremental N-Turn Benchmark), a high-fidelity, multi-turn medical diagnosis benchmark comprising 1,035 cases with clinically labeled evidence shards, controlled turn granularity, and information-preserving decomposition. Through systematic evaluation of 11 LLMs on MINT, we uncover three persistent behavioral patterns that significantly impact diagnostic decisions: (1) intent to answer, models rush to answer before sufficient evidence has been observed, with over 55% of answers committed within the first two turns; (2) self-correction, incorrect-to-correct answer revisions occur at up to 10.6 times the rate of correct-to-incorrect flips, revealing a latent capacity for self-correction that premature commitment forecloses; and (3) strong lures, clinically salient information such as laboratory results trigger premature answering even when models are explicitly instructed to wait. We translate these findings into clinically actionable guidance: deferring the diagnostic question to later turns reduces premature answering and improves accuracy at the first point of commitment by up to 62.6%, while reserving salient clinical evidence for later turns prevents a catastrophic accuracy drop of up to 23.3% caused by premature commitment. Our work provides both a controlled evaluation framework and concrete recommendations for improving the reliability of LLMs in multi-turn medical diagnosis.
Problem

Research questions and friction points this paper is trying to address.

multi-turn medical diagnosis
premature answering
self-correction
clinical reasoning
large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-turn medical diagnosis
premature answering
self-correction
clinical reasoning
LLM benchmarking
πŸ”Ž Similar Papers
No similar papers found.