Automatic Replication of LLM Mistakes in Medical Conversations

📅 2025-12-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of systematically evaluating medical errors in LLM-generated doctor–patient dialogues. We propose MedMistake, an automated framework that synthesizes realistic clinical dialogues via multi-LLM role-playing (patient, physician, and adjudicator), and introduces a novel multi-adjudicator committee mechanism for automated error identification and QA-pair distillation—transforming complex dialogue-level errors into standardized single-turn question–answer samples. We release MedMistake-Bench, the first clinically validated medical error benchmark (211 items), and MedMistake-All, a large-scale dataset (3,390 items), both covering high-difficulty, high-clinical-relevance errors. Comprehensive evaluation across 12 state-of-the-art LLMs demonstrates MedMistake’s capability to consistently reproduce and assess model-specific medical errors. Our framework establishes a scalable, empirically verifiable paradigm for safety evaluation of medical LLMs.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) are increasingly evaluated in clinical settings using multi-dimensional rubrics which quantify reasoning quality, safety, and patient-centeredness. Yet, replicating specific mistakes in other LLM models is not straightforward and often requires manual effort. We introduce MedMistake, an automatic pipeline that extracts mistakes LLMs make in patient-doctor conversations and converts them into a benchmark of single-shot QA pairs. Our pipeline (1) creates complex, conversational data between an LLM patient and LLM doctor, (2) runs an evaluation with a committee of 2 LLM judges across a variety of dimensions and (3) creates simplified single-shot QA scenarios from those mistakes. We release MedMistake-All, a dataset of 3,390 single-shot QA pairs where GPT-5 and Gemini 2.5 Pro are currently failing to answer correctly, as judged by two LLM judges. We used medical experts to validate a subset of 211/3390 questions (MedMistake-Bench), which we used to run a final evaluation of 12 frontier LLMs: Claude Opus 4.5, Claude Sonnet 4.5, DeepSeek-Chat, Gemini 2.5 Pro, Gemini 3 Pro, GPT-4o, GPT-5, GPT-5.1, GPT-5.2, Grok 4, Grok 4.1, Mistral Large. We found that GPT models, Claude and Grok obtained the best performance on MedMistake-Bench. We release both the doctor-validated benchmark (MedMistake-Bench), as well as the full dataset (MedMistake-All) at https://huggingface.co/datasets/TheLumos/MedicalMistakeBenchmark.
Problem

Research questions and friction points this paper is trying to address.

Automatically replicates LLM mistakes in medical conversations
Creates benchmark QA pairs from LLM errors for evaluation
Evaluates frontier LLMs on validated medical mistake dataset
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automatic pipeline extracts LLM mistakes from conversations
Converts mistakes into single-shot QA benchmark dataset
Uses committee of LLM judges for multi-dimensional evaluation