๐ค AI Summary
This study addresses the limitations of current evaluations of medical consultation agents, which predominantly focus on final outcomes while neglecting the coherence of clinical reasoning, patient safety, and structured diagnostic inquiry. To bridge this gap, the work proposes the first comprehensive evaluation framework spanning the entire clinical workflowโfrom history taking and diagnosis to treatment planning and follow-up. It introduces atomic information units (AIUs) to track sub-turn information acquisition and establishes a process-aware assessment system comprising 22 fine-grained metrics. The framework further integrates a constraint-aware treatment plan revision mechanism and an interactive simulation environment. Systematic evaluation of 19 large language models reveals that high diagnostic accuracy often masks inefficiencies in history taking and potential medication safety risks, highlighting a significant disconnect between medical knowledge and practical clinical competence.
๐ Abstract
Current evaluations of medical consultation agents often prioritize outcome-oriented tasks, frequently overlooking the end-to-end process integrity and clinical safety essential for real-world practice. While recent interactive benchmarks have introduced dynamic scenarios, they often remain fragmented and coarse-grained, failing to capture the structured inquiry logic and diagnostic rigor required in professional consultations. To bridge this gap, we propose MedConsultBench, a comprehensive framework designed to evaluate the complete online consultation cycle by covering the entire clinical workflow from history taking and diagnosis to treatment planning and follow-up Q\&A. Our methodology introduces Atomic Information Units (AIUs) to track clinical information acquisition at a sub-turn level, enabling precise monitoring of how key facts are elicited through 22 fine-grained metrics. By addressing the underspecification and ambiguity inherent in online consultations, the benchmark evaluates uncertainty-aware yet concise inquiry while emphasizing medication regimen compatibility and the ability to handle realistic post-prescription follow-up Q\&A via constraint-respecting plan revisions. Systematic evaluation of 19 large language models reveals that high diagnostic accuracy often masks significant deficiencies in information-gathering efficiency and medication safety. These results underscore a critical gap between theoretical medical knowledge and clinical practice ability, establishing MedConsultBench as a rigorous foundation for aligning medical AI with the nuanced requirements of real-world clinical care.