🤖 AI Summary
Existing medical LLM evaluations predominantly rely on standardized examinations, failing to capture real-world clinical complexity. To address this, we propose the first clinical-consensus-driven comprehensive evaluation framework. Methodologically, it establishes a fine-grained taxonomy covering five major categories, 22 subcategories, and 121 authentic clinical tasks; integrates 35 benchmarks—including 18 newly curated ones; introduces LLM-Jury, a novel collaborative scoring protocol achieving inter-rater reliability (ICC = 0.47), exceeding expert inter-annotator agreement; and incorporates computational cost modeling with standardized, normalized accuracy (0–1). This enables standardized, cost-aware, and reproducible medical LLM assessment. Empirical validation across nine state-of-the-art models reveals DeepSeek-R1 (66% win rate) and o3-mini (64%) as top performers; notably, Claude 3.5 Sonnet achieves comparable performance at only 60% of the computational cost.
📝 Abstract
While large language models (LLMs) achieve near-perfect scores on medical licensing exams, these evaluations inadequately reflect the complexity and diversity of real-world clinical practice. We introduce MedHELM, an extensible evaluation framework for assessing LLM performance for medical tasks with three key contributions. First, a clinician-validated taxonomy spanning 5 categories, 22 subcategories, and 121 tasks developed with 29 clinicians. Second, a comprehensive benchmark suite comprising 35 benchmarks (17 existing, 18 newly formulated) providing complete coverage of all categories and subcategories in the taxonomy. Third, a systematic comparison of LLMs with improved evaluation methods (using an LLM-jury) and a cost-performance analysis. Evaluation of 9 frontier LLMs, using the 35 benchmarks, revealed significant performance variation. Advanced reasoning models (DeepSeek R1: 66% win-rate; o3-mini: 64% win-rate) demonstrated superior performance, though Claude 3.5 Sonnet achieved comparable results at 40% lower estimated computational cost. On a normalized accuracy scale (0-1), most models performed strongly in Clinical Note Generation (0.73-0.85) and Patient Communication&Education (0.78-0.83), moderately in Medical Research Assistance (0.65-0.75), and generally lower in Clinical Decision Support (0.56-0.72) and Administration&Workflow (0.53-0.63). Our LLM-jury evaluation method achieved good agreement with clinician ratings (ICC = 0.47), surpassing both average clinician-clinician agreement (ICC = 0.43) and automated baselines including ROUGE-L (0.36) and BERTScore-F1 (0.44). Claude 3.5 Sonnet achieved comparable performance to top models at lower estimated cost. These findings highlight the importance of real-world, task-specific evaluation for medical use of LLMs and provides an open source framework to enable this.