The Dialogue That Heals: A Comprehensive Evaluation of Doctor Agents' Inquiry Capability

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This paper addresses the lack of systematic evaluation of question-asking capabilities in AI physicians during multi-turn medical consultations. To this end, we introduce MAQuE—the first benchmark specifically designed for assessing multi-turn diagnostic questioning in clinical settings. MAQuE comprises 3,000 LLM-based simulated patient agents and evaluates models across five dimensions: task completion, question quality, conversational proficiency, efficiency, and patient experience. It incorporates behavioral diversity modeling and a fine-grained scoring system. Experimental results reveal that current models exhibit weak questioning abilities, high sensitivity to patient behavior variations, and substantial fluctuations in diagnostic accuracy. Notably, MAQuE provides the first quantitative characterization of the trade-off between empathy and diagnostic utility. As a reproducible, multidimensional, and clinically grounded evaluation framework, MAQuE advances the development of AI physicians that balance clinical rigor with human-centered care.

Technology Category

Application Category

📝 Abstract

An effective physician should possess a combination of empathy, expertise, patience, and clear communication when treating a patient. Recent advances have successfully endowed AI doctors with expert diagnostic skills, particularly the ability to actively seek information through inquiry. However, other essential qualities of a good doctor remain overlooked. To bridge this gap, we present MAQuE(Medical Agent Questioning Evaluation), the largest-ever benchmark for the automatic and comprehensive evaluation of medical multi-turn questioning. It features 3,000 realistically simulated patient agents that exhibit diverse linguistic patterns, cognitive limitations, emotional responses, and tendencies for passive disclosure. We also introduce a multi-faceted evaluation framework, covering task success, inquiry proficiency, dialogue competence, inquiry efficiency, and patient experience. Experiments on different LLMs reveal substantial challenges across the evaluation aspects. Even state-of-the-art models show significant room for improvement in their inquiry capabilities. These models are highly sensitive to variations in realistic patient behavior, which considerably impacts diagnostic accuracy. Furthermore, our fine-grained metrics expose trade-offs between different evaluation perspectives, highlighting the challenge of balancing performance and practicality in real-world clinical settings.

Problem

Research questions and friction points this paper is trying to address.

Evaluating medical AI agents' multi-turn questioning capabilities comprehensively

Assessing diagnostic accuracy sensitivity to realistic patient behavior variations

Balancing performance-practicality trade-offs in clinical questioning efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Largest benchmark for medical questioning evaluation

Multi-faceted framework covering five evaluation aspects

Simulated patient agents with diverse behavioral characteristics

🔎 Similar Papers

AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments