MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks

📅 2025-05-26

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Existing medical LLM evaluations predominantly rely on standardized examinations, failing to capture real-world clinical complexity. To address this, we propose the first clinical-consensus-driven comprehensive evaluation framework. Methodologically, it establishes a fine-grained taxonomy covering five major categories, 22 subcategories, and 121 authentic clinical tasks; integrates 35 benchmarks—including 18 newly curated ones; introduces LLM-Jury, a novel collaborative scoring protocol achieving inter-rater reliability (ICC = 0.47), exceeding expert inter-annotator agreement; and incorporates computational cost modeling with standardized, normalized accuracy (0–1). This enables standardized, cost-aware, and reproducible medical LLM assessment. Empirical validation across nine state-of-the-art models reveals DeepSeek-R1 (66% win rate) and o3-mini (64%) as top performers; notably, Claude 3.5 Sonnet achieves comparable performance at only 60% of the computational cost.

Technology Category

Application Category

📝 Abstract

While large language models (LLMs) achieve near-perfect scores on medical licensing exams, these evaluations inadequately reflect the complexity and diversity of real-world clinical practice. We introduce MedHELM, an extensible evaluation framework for assessing LLM performance for medical tasks with three key contributions. First, a clinician-validated taxonomy spanning 5 categories, 22 subcategories, and 121 tasks developed with 29 clinicians. Second, a comprehensive benchmark suite comprising 35 benchmarks (17 existing, 18 newly formulated) providing complete coverage of all categories and subcategories in the taxonomy. Third, a systematic comparison of LLMs with improved evaluation methods (using an LLM-jury) and a cost-performance analysis. Evaluation of 9 frontier LLMs, using the 35 benchmarks, revealed significant performance variation. Advanced reasoning models (DeepSeek R1: 66% win-rate; o3-mini: 64% win-rate) demonstrated superior performance, though Claude 3.5 Sonnet achieved comparable results at 40% lower estimated computational cost. On a normalized accuracy scale (0-1), most models performed strongly in Clinical Note Generation (0.73-0.85) and Patient Communication&Education (0.78-0.83), moderately in Medical Research Assistance (0.65-0.75), and generally lower in Clinical Decision Support (0.56-0.72) and Administration&Workflow (0.53-0.63). Our LLM-jury evaluation method achieved good agreement with clinician ratings (ICC = 0.47), surpassing both average clinician-clinician agreement (ICC = 0.43) and automated baselines including ROUGE-L (0.36) and BERTScore-F1 (0.44). Claude 3.5 Sonnet achieved comparable performance to top models at lower estimated cost. These findings highlight the importance of real-world, task-specific evaluation for medical use of LLMs and provides an open source framework to enable this.

Problem

Research questions and friction points this paper is trying to address.

Evaluates LLMs' real-world medical task performance complexity

Introduces clinician-validated taxonomy for comprehensive medical assessments

Compares LLM cost-performance for clinical decision support accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Clinician-validated taxonomy for medical tasks

Comprehensive benchmark suite with 35 benchmarks

LLM-jury evaluation method for improved accuracy

🔎 Similar Papers

No similar papers found.