MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks

📅 2025-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing medical LLM evaluations predominantly rely on standardized examinations, failing to capture real-world clinical complexity. To address this, we propose the first clinical-consensus-driven comprehensive evaluation framework. Methodologically, it establishes a fine-grained taxonomy covering five major categories, 22 subcategories, and 121 authentic clinical tasks; integrates 35 benchmarks—including 18 newly curated ones; introduces LLM-Jury, a novel collaborative scoring protocol achieving inter-rater reliability (ICC = 0.47), exceeding expert inter-annotator agreement; and incorporates computational cost modeling with standardized, normalized accuracy (0–1). This enables standardized, cost-aware, and reproducible medical LLM assessment. Empirical validation across nine state-of-the-art models reveals DeepSeek-R1 (66% win rate) and o3-mini (64%) as top performers; notably, Claude 3.5 Sonnet achieves comparable performance at only 60% of the computational cost.

Technology Category

Application Category

📝 Abstract
While large language models (LLMs) achieve near-perfect scores on medical licensing exams, these evaluations inadequately reflect the complexity and diversity of real-world clinical practice. We introduce MedHELM, an extensible evaluation framework for assessing LLM performance for medical tasks with three key contributions. First, a clinician-validated taxonomy spanning 5 categories, 22 subcategories, and 121 tasks developed with 29 clinicians. Second, a comprehensive benchmark suite comprising 35 benchmarks (17 existing, 18 newly formulated) providing complete coverage of all categories and subcategories in the taxonomy. Third, a systematic comparison of LLMs with improved evaluation methods (using an LLM-jury) and a cost-performance analysis. Evaluation of 9 frontier LLMs, using the 35 benchmarks, revealed significant performance variation. Advanced reasoning models (DeepSeek R1: 66% win-rate; o3-mini: 64% win-rate) demonstrated superior performance, though Claude 3.5 Sonnet achieved comparable results at 40% lower estimated computational cost. On a normalized accuracy scale (0-1), most models performed strongly in Clinical Note Generation (0.73-0.85) and Patient Communication&Education (0.78-0.83), moderately in Medical Research Assistance (0.65-0.75), and generally lower in Clinical Decision Support (0.56-0.72) and Administration&Workflow (0.53-0.63). Our LLM-jury evaluation method achieved good agreement with clinician ratings (ICC = 0.47), surpassing both average clinician-clinician agreement (ICC = 0.43) and automated baselines including ROUGE-L (0.36) and BERTScore-F1 (0.44). Claude 3.5 Sonnet achieved comparable performance to top models at lower estimated cost. These findings highlight the importance of real-world, task-specific evaluation for medical use of LLMs and provides an open source framework to enable this.
Problem

Research questions and friction points this paper is trying to address.

Evaluates LLMs' real-world medical task performance complexity
Introduces clinician-validated taxonomy for comprehensive medical assessments
Compares LLM cost-performance for clinical decision support accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Clinician-validated taxonomy for medical tasks
Comprehensive benchmark suite with 35 benchmarks
LLM-jury evaluation method for improved accuracy
🔎 Similar Papers
No similar papers found.
Suhana Bedi
Suhana Bedi
PhD Student, Stanford University
Generative AI in healthcareMultimodal data fusionData Commons
Hejie Cui
Hejie Cui
Stanford University
Large Language ModelsMultimodal LearningData MiningMachine LearningAI for Health
M
Miguel Fuentes
Stanford University School of Medicine, Stanford, CA, USA.
Alyssa Unell
Alyssa Unell
Stanford University
Michael Wornow
Michael Wornow
Stanford University
machine learninghealthcare
Juan M. Banda
Juan M. Banda
Stanford Health Care
Generative AIBiomedical InformaticsBig Data MiningLarge-Scale RetrievalMachine Learning
N
Nikesh Kotecha
Stanford Health Care, Palo Alto, CA, USA.
T
Timothy Keyes
Stanford Health Care, Palo Alto, CA, USA.
Yifan Mai
Yifan Mai
Research Engineer, Stanford CRFM
Machine Learning
M
Mert Oez
Microsoft Corporation, Redmond, WA, USA.
H
Hao Qiu
Microsoft Corporation, Redmond, WA, USA.
Shrey Jain
Shrey Jain
Microsoft Corporation, Redmond, WA, USA.
L
Leonardo Schettini
Microsoft Corporation, Redmond, WA, USA.
M
Mehr Kashyap
Stanford University School of Medicine, Stanford, CA, USA.
Jason Alan Fries
Jason Alan Fries
Stanford University
HealthcareFoundation ModelsLanguage ModelsData-Centric AINLP
A
Akshay Swaminathan
Stanford University School of Medicine, Stanford, CA, USA.
Philip Chung
Philip Chung
Stanford University School of Medicine, Stanford, CA, USA.
F
Fateme Nateghi
Stanford University School of Medicine, Stanford, CA, USA.
A
Asad Aali
Stanford University School of Medicine, Stanford, CA, USA.
Ashwin Nayak
Ashwin Nayak
University of Waterloo
Quantum ComputationQuantum InformationTheoretical Computer Science
S
Shivam Vedak
Stanford University School of Medicine, Stanford, CA, USA.
S
Sneha S. Jain
Stanford University School of Medicine, Stanford, CA, USA.
B
Birju Patel
Stanford University School of Medicine, Stanford, CA, USA.
O
Oluseyi Fayanju
Stanford University School of Medicine, Stanford, CA, USA.
S
Shreya Shah
Stanford University School of Medicine, Stanford, CA, USA.
E
Ethan Goh
Stanford University School of Medicine, Stanford, CA, USA.
D
Dong-han Yao
Stanford University School of Medicine, Stanford, CA, USA.
B
Brian Soetikno
Stanford University School of Medicine, Stanford, CA, USA.
E
Eduardo Reis
Stanford University School of Medicine, Stanford, CA, USA.
Sergios Gatidis
Sergios Gatidis
Stanford Medicine
Healthcare AIMedical Image and Data AnalysisPediatric RadiologyHybrid Imaging
V
Vasu Divi
Stanford University School of Medicine, Stanford, CA, USA.
R
Robson Capasso
Stanford University School of Medicine, Stanford, CA, USA.
R
Rachna Saralkar
Stanford University School of Medicine, Stanford, CA, USA.
C
Chia-Chun Chiang
Stanford University School of Medicine, Stanford, CA, USA.
Jenelle Jindal
Jenelle Jindal
Stanford University
T
Tho Pham
Stanford University School of Medicine, Stanford, CA, USA.
F
Faraz Ghoddusi
Stanford University School of Medicine, Stanford, CA, USA.
S
Steven Lin
Stanford University School of Medicine, Stanford, CA, USA.
A
Albert S. Chiou
Stanford University School of Medicine, Stanford, CA, USA.
C
Christy Hong
Stanford University School of Medicine, Stanford, CA, USA.
M
Mohana Roy
Stanford University School of Medicine, Stanford, CA, USA.
Michael F. Gensheimer
Michael F. Gensheimer
Stanford University School of Medicine, Stanford, CA, USA.
H
Hinesh Patel
Stanford University School of Medicine, Stanford, CA, USA.
K
Kevin Schulman
Stanford University School of Medicine, Stanford, CA, USA.
D
Dev Dash
Stanford University School of Medicine, Stanford, CA, USA.
D
Danton Char
Stanford University School of Medicine, Stanford, CA, USA.
L
Lance Downing
Stanford University School of Medicine, Stanford, CA, USA.
F
Francois Grolleau
Stanford University School of Medicine, Stanford, CA, USA.
K
Kameron Black
Stanford University School of Medicine, Stanford, CA, USA.
B
Bethel Mieso
Stanford University School of Medicine, Stanford, CA, USA.
A
Aydin Zahedivash
Stanford University School of Medicine, Stanford, CA, USA.
W
Wen-wai Yim
Microsoft Corporation, Redmond, WA, USA.
Harshita Sharma
Harshita Sharma
Senior Researcher at Microsoft
Computer visionMedical image analysisMachine learningBiomedical imagingMultimodal methods
T
Tony Lee
Center for Research on Foundation Models (CRFM) & Department of Computer Science, Stanford University, CA, USA.
H
Hannah Kirsch
Stanford Health Care, Palo Alto, CA, USA.
J
Jennifer Lee
Stanford Health Care, Palo Alto, CA, USA.
N
Nerissa Ambers
Stanford Health Care, Palo Alto, CA, USA.
C
Carlene Lugtu
Stanford Health Care, Palo Alto, CA, USA.
A
Aditya Sharma
Stanford Health Care, Palo Alto, CA, USA.
B
Bilal Mawji
Stanford Health Care, Palo Alto, CA, USA.
A
Alex Alekseyev
Stanford Health Care, Palo Alto, CA, USA.
V
Vicky Zhou
Stanford Health Care, Palo Alto, CA, USA.
V
Vikas Kakkar
Stanford Health Care, Palo Alto, CA, USA.
J
Jarrod Helzer
Stanford Health Care, Palo Alto, CA, USA.
A
Anurang Revri
Stanford Health Care, Palo Alto, CA, USA.
Y
Yair Bannett
Stanford University School of Medicine, Stanford, CA, USA.
R
Roxana Daneshjou
Stanford University School of Medicine, Stanford, CA, USA.
J
Jonathan Chen
Stanford University School of Medicine, Stanford, CA, USA.
Emily Alsentzer
Emily Alsentzer
Assistant Professor, Stanford University
machine learning for healthcare
Keith Morse
Keith Morse
Clinical Assistant Professor, Stanford University
Clinical informatics
N
Nirmal Ravi
Stanford University School of Medicine, Stanford, CA, USA.
Nima Aghaeepour
Nima Aghaeepour
Stanford University
Machine LearningArtificial IntelligenceSystems ImmunologyData IntegrationWearable Devices
V
Vanessa Kennedy
Stanford University School of Medicine, Stanford, CA, USA.
A
Akshay Chaudhari
Stanford University School of Medicine, Stanford, CA, USA.
T
Thomas Wang
Center for Research on Foundation Models (CRFM) & Department of Computer Science, Stanford University, CA, USA.
Sanmi Koyejo
Sanmi Koyejo
Assistant Professor, Stanford University
Machine LearningHealthcare AINeuroinformatics
M
Matthew P. Lungren
Stanford University School of Medicine, Stanford, CA, USA.
Eric Horvitz
Eric Horvitz
Microsoft
Machine intelligencedecision theorydecisions under uncertaintyinformation retrievalbounded
Percy Liang
Percy Liang
Associate Professor of Computer Science, Stanford University
machine learningnatural language processing
M
Mike Pfeffer
Stanford Health Care, Palo Alto, CA, USA.
N
Nigam H. Shah
Stanford University School of Medicine, Stanford, CA, USA.