🤖 AI Summary
Current large language models (LLMs) achieve strong performance on medical benchmark datasets but lack strategic diagnostic questioning and empathetic communication capabilities in real-world clinical settings. To address this, we propose an experience-driven multi-agent reinforcement learning framework that— for the first time—decouples and jointly optimizes clinical decision accuracy and empathetic dialogue proficiency. Our method establishes a multi-agent interactive environment with a dual-layer reward mechanism (clinical correctness + conversational quality) and integrates an experience replay buffer to enhance policy learning. The agent is trained on high-quality, expert-annotated clinical dialogue trajectories, combining LLMs, multi-agent systems, and experience replay techniques. Experiments demonstrate that our AI physician significantly outperforms leading open-source domain-specific models and multiple closed-source foundation models on HealthBench and MAQuE, achieving superior parameter efficiency. Human evaluation by clinical experts further confirms strong preference for its multi-turn, empathetic diagnostic dialogues.
📝 Abstract
The professionalism of a human doctor in outpatient service depends on two core abilities: the ability to make accurate medical decisions and the medical consultation skill to conduct strategic, empathetic patient inquiry. Existing Large Language Models (LLMs) have achieved remarkable accuracy on medical decision-making benchmarks. However, they often lack the ability to conduct the strategic and empathetic consultation, which is essential for real-world clinical scenarios. To address this gap, we propose Doctor-R1, an AI doctor agent trained to master both of the capabilities by ask high-yield questions and conduct strategic multi-turn inquiry to guide decision-making. Our framework introduces three key components: a multi-agent interactive environment, a two-tiered reward architecture that separately optimizes clinical decision-making and communicative inquiry skills, and an experience repository to ground policy learning in high-quality prior trajectories. We evaluate Doctor-R1 on OpenAI's HealthBench and MAQuE, assessed across multi-facet metrics, such as communication quality, user experience, and task accuracy. Remarkably, Doctor-R1 surpasses state-of-the-art open-source specialized LLMs by a substantial margin with higher parameter efficiency and outperforms powerful proprietary models. Furthermore, the human evaluations show a strong preference for Doctor-R1 to generate human-preferred clinical dialogue, demonstrating the effectiveness of the framework.