π€ AI Summary
This paper addresses the lack of systematic evaluation of question-asking capabilities in AI physicians during multi-turn medical consultations. To this end, we introduce MAQuEβthe first benchmark specifically designed for assessing multi-turn diagnostic questioning in clinical settings. MAQuE comprises 3,000 LLM-based simulated patient agents and evaluates models across five dimensions: task completion, question quality, conversational proficiency, efficiency, and patient experience. It incorporates behavioral diversity modeling and a fine-grained scoring system. Experimental results reveal that current models exhibit weak questioning abilities, high sensitivity to patient behavior variations, and substantial fluctuations in diagnostic accuracy. Notably, MAQuE provides the first quantitative characterization of the trade-off between empathy and diagnostic utility. As a reproducible, multidimensional, and clinically grounded evaluation framework, MAQuE advances the development of AI physicians that balance clinical rigor with human-centered care.
π Abstract
An effective physician should possess a combination of empathy, expertise, patience, and clear communication when treating a patient. Recent advances have successfully endowed AI doctors with expert diagnostic skills, particularly the ability to actively seek information through inquiry. However, other essential qualities of a good doctor remain overlooked. To bridge this gap, we present MAQuE(Medical Agent Questioning Evaluation), the largest-ever benchmark for the automatic and comprehensive evaluation of medical multi-turn questioning. It features 3,000 realistically simulated patient agents that exhibit diverse linguistic patterns, cognitive limitations, emotional responses, and tendencies for passive disclosure. We also introduce a multi-faceted evaluation framework, covering task success, inquiry proficiency, dialogue competence, inquiry efficiency, and patient experience. Experiments on different LLMs reveal substantial challenges across the evaluation aspects. Even state-of-the-art models show significant room for improvement in their inquiry capabilities. These models are highly sensitive to variations in realistic patient behavior, which considerably impacts diagnostic accuracy. Furthermore, our fine-grained metrics expose trade-offs between different evaluation perspectives, highlighting the challenge of balancing performance and practicality in real-world clinical settings.