ThReadMed-QA: A Multi-Turn Medical Dialogue Benchmark from Real Patient Questions

📅 2026-03-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the limitation of existing medical question-answering benchmarks, which predominantly focus on single-turn interactions and fail to capture the iterative clarification and follow-up inherent in real-world physician–patient dialogues. To bridge this gap, the authors construct the first multi-turn evaluation benchmark derived from authentic online medical consultations sourced from r/AskDocs, comprising 2,437 dialogue threads and 8,204 question–answer pairs. They further introduce two novel metrics—Conversation Coherence Score (CCS) and Error Propagation Rate (EPR)—to assess model performance across turns. Evaluating five leading large language models, including GPT-5 and Claude Haiku, via an LLM-as-a-judge approach, they find that even the best-performing model, GPT-5, achieves only a 41.2% full correctness rate, with error rates nearly tripling by the third turn, revealing significant reliability deficiencies in current models for multi-turn medical dialogue.

Technology Category

Application Category

📝 Abstract
Medical question-answering benchmarks predominantly evaluate single-turn exchanges, failing to capture the iterative, clarification-seeking nature of real patient consultations. We introduce ThReadMed-QA, a benchmark of 2,437 fully-answered patient-physician conversation threads extracted from r/AskDocs, comprising 8,204 question-answer pairs across up to 9 turns. Unlike prior work relying on simulated dialogues, adversarial prompts, or exam-style questions, ThReadMed-QA captures authentic patient follow-up questions and verified physician responses, reflecting how patients naturally seek medical information online. We evaluate five state-of-the-art LLMs -- GPT-5, GPT-4o, Claude Haiku, Gemini 2.5 Flash, and Llama 3.3 70B -- on a stratified test split of 238 conversations (948 QA pairs) using a calibrated LLM-as-a-judge rubric grounded in physician ground truth. Even the strongest model, GPT-5, achieves only 41.2% fully-correct responses. All five models degrade significantly from turn 0 to turn 2 (p < 0.001), with wrong-answer rates roughly tripling by the third turn. We identify a fundamental tension between single-turn capability and multi-turn reliability: models with the strongest initial performance (GPT-5: 75.2; Claude Haiku: 72.3 out of 100) exhibit the steepest declines by turn 2 (dropping 16.2 and 25.0 points respectively), while weaker models plateau or marginally improve. We introduce two metrics to quantify multi-turn failure modes: Conversational Consistency Score (CCS) and Error Propagation Rate (EPR). CCS reveals that nearly one in three Claude Haiku conversations swings between a fully correct and a completely wrong response within the same thread. EPR shows that a single wrong turn raises the probability of a subsequent wrong turn by 1.9-6.1x across all models.
Problem

Research questions and friction points this paper is trying to address.

medical question answering
multi-turn dialogue
patient-physician conversation
benchmark
conversational consistency
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-turn medical dialogue
real-world patient questions
LLM reliability
Conversational Consistency Score
Error Propagation Rate
🔎 Similar Papers
No similar papers found.
M
Monica Munnangi
Khoury College of Computer Sciences, Northeastern University, Boston
Saiph Savage
Saiph Savage
Northeastern University & UNAM
Human Centered AIGig WorkDigital CivicsDigital Labor PlatformsCrowdsourcing