🤖 AI Summary
This work addresses the challenge of systematically evaluating medical errors in LLM-generated doctor–patient dialogues. We propose MedMistake, an automated framework that synthesizes realistic clinical dialogues via multi-LLM role-playing (patient, physician, and adjudicator), and introduces a novel multi-adjudicator committee mechanism for automated error identification and QA-pair distillation—transforming complex dialogue-level errors into standardized single-turn question–answer samples. We release MedMistake-Bench, the first clinically validated medical error benchmark (211 items), and MedMistake-All, a large-scale dataset (3,390 items), both covering high-difficulty, high-clinical-relevance errors. Comprehensive evaluation across 12 state-of-the-art LLMs demonstrates MedMistake’s capability to consistently reproduce and assess model-specific medical errors. Our framework establishes a scalable, empirically verifiable paradigm for safety evaluation of medical LLMs.
📝 Abstract
Large language models (LLMs) are increasingly evaluated in clinical settings using multi-dimensional rubrics which quantify reasoning quality, safety, and patient-centeredness. Yet, replicating specific mistakes in other LLM models is not straightforward and often requires manual effort. We introduce MedMistake, an automatic pipeline that extracts mistakes LLMs make in patient-doctor conversations and converts them into a benchmark of single-shot QA pairs. Our pipeline (1) creates complex, conversational data between an LLM patient and LLM doctor, (2) runs an evaluation with a committee of 2 LLM judges across a variety of dimensions and (3) creates simplified single-shot QA scenarios from those mistakes. We release MedMistake-All, a dataset of 3,390 single-shot QA pairs where GPT-5 and Gemini 2.5 Pro are currently failing to answer correctly, as judged by two LLM judges. We used medical experts to validate a subset of 211/3390 questions (MedMistake-Bench), which we used to run a final evaluation of 12 frontier LLMs: Claude Opus 4.5, Claude Sonnet 4.5, DeepSeek-Chat, Gemini 2.5 Pro, Gemini 3 Pro, GPT-4o, GPT-5, GPT-5.1, GPT-5.2, Grok 4, Grok 4.1, Mistral Large. We found that GPT models, Claude and Grok obtained the best performance on MedMistake-Bench. We release both the doctor-validated benchmark (MedMistake-Bench), as well as the full dataset (MedMistake-All) at https://huggingface.co/datasets/TheLumos/MedicalMistakeBenchmark.