Benchmarking Large Language Models on Answering and Explaining Challenging Medical Questions

📅 2024-02-28
🏛️ arXiv.org
📈 Citations: 14
Influential: 2
📄 PDF
🤖 AI Summary
Current large language models (LLMs) lack rigorous, clinically grounded evaluation of reasoning interpretability in complex medical decision-making. Method: We introduce JAMA Clinical Challenge and Medbullets—two high-difficulty, multiple-choice clinical benchmarks featuring authoritative, fine-grained expert explanations—the first such resources designed for real-world clinical scenarios. Evaluation employs zero-shot and few-shot prompting, automated explanation quality scoring, dual-blinded clinical expert assessment (Cohen’s κ = 0.82), and comparative analysis. Contribution/Results: Seven state-of-the-art LLMs exhibit substantially lower performance on these benchmarks than on conventional exam-style benchmarks. Their generated explanations frequently contain logical gaps and factual hallucinations, exposing critical deficiencies in clinical-grade causal reasoning and domain-specific knowledge integration—highlighting a fundamental gap between current LLM capabilities and safe, interpretable clinical deployment.

Technology Category

Application Category

📝 Abstract
LLMs have demonstrated impressive performance in answering medical questions, such as achieving passing scores on medical licensing examinations. However, medical board exams or general clinical questions do not capture the complexity of realistic clinical cases. Moreover, the lack of reference explanations means we cannot easily evaluate the reasoning of model decisions, a crucial component of supporting doctors in making complex medical decisions. To address these challenges, we construct two new datasets: JAMA Clinical Challenge and Medbullets.footnote{Datasets and code are available at url{https://github.com/HanjieChen/ChallengeClinicalQA}.} JAMA Clinical Challenge consists of questions based on challenging clinical cases, while Medbullets comprises simulated clinical questions. Both datasets are structured as multiple-choice question-answering tasks, accompanied by expert-written explanations. We evaluate seven LLMs on the two datasets using various prompts. Experiments demonstrate that our datasets are harder than previous benchmarks. In-depth automatic and human evaluations of model-generated explanations provide insights into the promise and deficiency of LLMs for explainable medical QA.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Complex Medical Problems
Decision-making Assistance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Medical Problem Datasets
Language Model Evaluation
Complex Question Analysis
🔎 Similar Papers
No similar papers found.
H
Hanjie Chen
Johns Hopkins University
Z
Zhouxiang Fang
Johns Hopkins University
Y
Yash Singla
Johns Hopkins University
Mark Dredze
Mark Dredze
Johns Hopkins University
Machine LearningNatural Language ProcessingHealth InformaticsComputational Epidemiology