Superhuman performance of a large language model on the reasoning tasks of a physician

📅 2024-12-14
🏛️ arXiv.org
📈 Citations: 6
Influential: 0
📄 PDF
🤖 AI Summary
This study systematically evaluates large language models (LLMs) on complex clinical diagnostic reasoning across five dimensions: differential diagnosis generation, diagnostic and management reasoning, triage, probabilistic reasoning, and clinical decision calibration. Using the OpenAI o1-preview model augmented with chain-of-thought prompting, outputs were rigorously adjudicated by physician experts applying psychometric standards. To our knowledge, this is the first empirically validated study demonstrating LLM superiority over human clinicians in specific clinical reasoning tasks—namely, differential diagnosis generation and diagnostic/management reasoning quality—outperforming both historical physician cohorts and prior LLMs. However, the model fell short of human performance in probabilistic reasoning and triage. The work advances AI evaluation paradigms for real-world clinical settings by establishing a novel methodology centered on expert adjudication and standardized psychometric frameworks, thereby setting a benchmark for rigorous, clinically grounded LLM assessment.

Technology Category

Application Category

📝 Abstract
Performance of large language models (LLMs) on medical tasks has traditionally been evaluated using multiple choice question benchmarks. However, such benchmarks are highly constrained, saturated with repeated impressive performance by LLMs, and have an unclear relationship to performance in real clinical scenarios. Clinical reasoning, the process by which physicians employ critical thinking to gather and synthesize clinical data to diagnose and manage medical problems, remains an attractive benchmark for model performance. Prior LLMs have shown promise in outperforming clinicians in routine and complex diagnostic scenarios. We sought to evaluate OpenAI's o1-preview model, a model developed to increase run-time via chain of thought processes prior to generating a response. We characterize the performance of o1-preview with five experiments including differential diagnosis generation, display of diagnostic reasoning, triage differential diagnosis, probabilistic reasoning, and management reasoning, adjudicated by physician experts with validated psychometrics. Our primary outcome was comparison of the o1-preview output to identical prior experiments that have historical human controls and benchmarks of previous LLMs. Significant improvements were observed with differential diagnosis generation and quality of diagnostic and management reasoning. No improvements were observed with probabilistic reasoning or triage differential diagnosis. This study highlights o1-preview's ability to perform strongly on tasks that require complex critical thinking such as diagnosis and management while its performance on probabilistic reasoning tasks was similar to past models. New robust benchmarks and scalable evaluation of LLM capabilities compared to human physicians are needed along with trials evaluating AI in real clinical settings.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM performance on clinical diagnostic reasoning tasks
Comparing AI and physician diagnostic accuracy in emergency settings
Assessing LLM superhuman abilities in medical decision-making and reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM outperforms physicians in clinical reasoning
Five experiments validate AI diagnostic accuracy
Real-world ER study confirms LLM superhuman performance
🔎 Similar Papers
No similar papers found.
P
P. Brodeur
Department of Internal Medicine, Beth Israel Deaconess Medical Center, Boston, Massachusetts
T
Thomas A. Buckley
Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts
Z
Zahir Kanjee
Department of Internal Medicine, Beth Israel Deaconess Medical Center, Boston, Massachusetts
E
Ethan Goh
Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, California; Stanford Clinical Excellence Research Center, Stanford University, Stanford, California
E
Evelyn Bin Ling
Department of Internal Medicine, Stanford University School of Medicine, Stanford, California
P
Priyank Jain
Department of Internal Medicine, Cambridge Health Alliance, Cambridge, Massachusetts
S
Stephanie Cabral
Department of Internal Medicine, Beth Israel Deaconess Medical Center, Boston, Massachusetts
R
Raja-Elie Abdulnour
Division of Pulmonary and Critical Care Medicine, Brigham and Women's Hospital, Boston, Massachusetts
A
Adrian Haimovich
Department of Emergency Medicine, Beth Israel Deaconess Medical Center, Boston, Massachusetts
J
Jason Freed
Department of Hematology-Oncology, Beth Israel Deaconess Medical Center, Boston, Massachusetts
A
Andrew P J Olson
Department of Hospital Medicine, University of Minnesota Medical School, Minneapolis
Daniel J. Morgan
Daniel J. Morgan
Department of Epidemiology and Public Health, University of Maryland School of Medicine, Baltimore, Maryland; Veterans Affairs Maryland Healthcare System, Baltimore, Maryland
J
Jason Hom
Department of Internal Medicine, Stanford University School of Medicine, Stanford, California
R
Robert J Gallo
Center for Innovation to Implementation, VA Palo Alto Health Care System, Palo Alto, California
Eric Horvitz
Eric Horvitz
Microsoft
Machine intelligencedecision theorydecisions under uncertaintyinformation retrievalbounded
J
Jonathan H. Chen
Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, California; Stanford Clinical Excellence Research Center, Stanford University, Stanford, California; Department of Internal Medicine, Stanford University School of Medicine, Stanford, California
A
Arjun K. Manrai
Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts
Adam Rodman
Adam Rodman
Assistant Professor of Medicine, Harvard Medical School
Clinical reasoningAIdigital educationmedical history