🤖 AI Summary
Current large language models (LLMs) lack rigorous, clinically grounded evaluation of reasoning interpretability in complex medical decision-making. Method: We introduce JAMA Clinical Challenge and Medbullets—two high-difficulty, multiple-choice clinical benchmarks featuring authoritative, fine-grained expert explanations—the first such resources designed for real-world clinical scenarios. Evaluation employs zero-shot and few-shot prompting, automated explanation quality scoring, dual-blinded clinical expert assessment (Cohen’s κ = 0.82), and comparative analysis. Contribution/Results: Seven state-of-the-art LLMs exhibit substantially lower performance on these benchmarks than on conventional exam-style benchmarks. Their generated explanations frequently contain logical gaps and factual hallucinations, exposing critical deficiencies in clinical-grade causal reasoning and domain-specific knowledge integration—highlighting a fundamental gap between current LLM capabilities and safe, interpretable clinical deployment.
📝 Abstract
LLMs have demonstrated impressive performance in answering medical questions, such as achieving passing scores on medical licensing examinations. However, medical board exams or general clinical questions do not capture the complexity of realistic clinical cases. Moreover, the lack of reference explanations means we cannot easily evaluate the reasoning of model decisions, a crucial component of supporting doctors in making complex medical decisions. To address these challenges, we construct two new datasets: JAMA Clinical Challenge and Medbullets.footnote{Datasets and code are available at url{https://github.com/HanjieChen/ChallengeClinicalQA}.} JAMA Clinical Challenge consists of questions based on challenging clinical cases, while Medbullets comprises simulated clinical questions. Both datasets are structured as multiple-choice question-answering tasks, accompanied by expert-written explanations. We evaluate seven LLMs on the two datasets using various prompts. Experiments demonstrate that our datasets are harder than previous benchmarks. In-depth automatic and human evaluations of model-generated explanations provide insights into the promise and deficiency of LLMs for explainable medical QA.