🤖 AI Summary
Existing medical large language models (LLMs) lack systematic robustness evaluation in multi-turn clinical dialogues; conventional single-turn benchmarks fail to capture real-world challenges such as contradictory inputs, misleading contextual cues, and authoritative bias. Method: We propose MedQA-Followup—a novel framework that formally defines and distinguishes shallow versus deep robustness in multi-turn medical question answering, introducing the “indirect–direct intervention” analytical axis. Leveraging MedQA, we construct a controllable multi-turn test suite simulating realistic clinical consultation disruptions. Contribution/Results: Experiments on five state-of-the-art medical LLMs reveal a dramatic accuracy drop—from 91.2% in single-turn settings to as low as 13.5% in multi-turn scenarios—with indirect contextual interference proving more detrimental than direct prompt manipulation. These findings expose structural fragility in sequential clinical interaction, offering critical risk awareness for clinical deployment and establishing a new evaluation paradigm for dialogue robustness in medical AI.
📝 Abstract
Large language models (LLMs) are rapidly transitioning into medical clinical use, yet their reliability under realistic, multi-turn interactions remains poorly understood. Existing evaluation frameworks typically assess single-turn question answering under idealized conditions, overlooking the complexities of medical consultations where conflicting input, misleading context, and authority influence are common. We introduce MedQA-Followup, a framework for systematically evaluating multi-turn robustness in medical question answering. Our approach distinguishes between shallow robustness (resisting misleading initial context) and deep robustness (maintaining accuracy when answers are challenged across turns), while also introducing an indirect-direct axis that separates contextual framing (indirect) from explicit suggestion (direct). Using controlled interventions on the MedQA dataset, we evaluate five state-of-the-art LLMs and find that while models perform reasonably well under shallow perturbations, they exhibit severe vulnerabilities in multi-turn settings, with accuracy dropping from 91.2% to as low as 13.5% for Claude Sonnet 4. Counterintuitively, indirect, context-based interventions are often more harmful than direct suggestions, yielding larger accuracy drops across models and exposing a significant vulnerability for clinical deployment. Further compounding analyses reveal model differences, with some showing additional performance drops under repeated interventions while others partially recovering or even improving. These findings highlight multi-turn robustness as a critical but underexplored dimension for safe and reliable deployment of medical LLMs.