MedDialogRubrics: A Comprehensive Benchmark and Evaluation Framework for Multi-turn Medical Consultations in Large Language Models

📅 2026-01-06
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Existing evaluation frameworks struggle to effectively assess large language models’ (LLMs’) capabilities in information gathering and diagnostic reasoning during multi-turn medical consultations. To address this gap, this work introduces a benchmark comprising 5,200 synthetic clinical cases and over 60,000 fine-grained scoring criteria, leveraging a multi-agent system to simulate authentic patient dialogues and integrating evidence-based medical guidelines to define “mandatory inquiry” dimensions. The study innovatively incorporates a clinician-validated LLM scoring pipeline, a dynamic hallucination detection and correction mechanism, and a high-quality dialogue generation strategy based on rejection sampling. Comprehensive evaluations of leading LLMs reveal significant shortcomings in current systems during multi-turn medical interactions, underscoring the urgent need for architectural improvements in dialogue management rather than reliance solely on base model fine-tuning.

Technology Category

Application Category

📝 Abstract
Medical conversational AI (AI) plays a pivotal role in the development of safer and more effective medical dialogue systems. However, existing benchmarks and evaluation frameworks for assessing the information-gathering and diagnostic reasoning abilities of medical large language models (LLMs) have not been rigorously evaluated. To address these gaps, we present MedDialogRubrics, a novel benchmark comprising 5,200 synthetically constructed patient cases and over 60,000 fine-grained evaluation rubrics generated by LLMs and subsequently refined by clinical experts, specifically designed to assess the multi-turn diagnostic capabilities of LLM. Our framework employs a multi-agent system to synthesize realistic patient records and chief complaints from underlying disease knowledge without accessing real-world electronic health records, thereby mitigating privacy and data-governance concerns. We design a robust Patient Agent that is limited to a set of atomic medical facts and augmented with a dynamic guidance mechanism that continuously detects and corrects hallucinations throughout the dialogue, ensuring internal coherence and clinical plausibility of the simulated cases. Furthermore, we propose a structured LLM-based and expert-annotated rubric-generation pipeline that retrieves Evidence-Based Medicine (EBM) guidelines and utilizes the reject sampling to derive a prioritized set of rubric items ("must-ask"items) for each case. We perform a comprehensive evaluation of state-of-the-art models and demonstrate that, across multiple assessment dimensions, current models face substantial challenges. Our results indicate that improving medical dialogue will require advances in dialogue management architectures, not just incremental tuning of the base-model.
Problem

Research questions and friction points this paper is trying to address.

medical dialogue systems
large language models
diagnostic reasoning
evaluation benchmark
multi-turn consultation
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-agent simulation
hallucination correction
evidence-based rubrics
privacy-preserving benchmark
dialogue management architecture
🔎 Similar Papers
No similar papers found.
L
Lecheng Gong
Ant Group
W
Weimin Fang
Ant Group
T
Ting Yang
Ant Group
D
Dongjie Tao
Ant Group
C
Chunxiao Guo
Ant Group
P
Peng Wei
Ant Group
Bo Xie
Bo Xie
Facebook
Machine LearningOptimizationDeep learning
J
Jinqun Guan
Ant Group
Z
Zixiao Chen
Ant Group
F
Fang Shi
Ant Group
Jinjie Gu
Jinjie Gu
ant group
机器学习,推荐
J
Junwei Liu
Ant Group