HealthBench: Evaluating Large Language Models Towards Improved Human Health

📅 2025-05-13

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This work addresses the lack of reliable, safe, and multi-turn evaluation frameworks for large language models (LLMs) in healthcare. We introduce HealthBench, the first open-source multi-turn medical dialogue benchmark, comprising 5,000 real-world clinical conversations and 48,562 fine-grained evaluation criteria authored by 262 physicians across domains including emergency medicine, clinical data interpretation, and global health. We propose a physician-consensus-driven, open-ended multi-turn evaluation paradigm and release two enhanced variants: Consensus (34 consensus-based dimensions) and Hard (current SOTA accuracy: only 32%), overcoming limitations of conventional single-turn multiple-choice or short-answer evaluations. Empirical results show GPT-4o achieves 32% accuracy on Hard, while o3 reaches 60%; notably, GPT-4.1 nano attains comparable performance at 1/25 the cost of GPT-4o, validating a lightweight, cost-efficient pathway for medical LLM evaluation.

Technology Category

Application Category

📝 Abstract

We present HealthBench, an open-source benchmark measuring the performance and safety of large language models in healthcare. HealthBench consists of 5,000 multi-turn conversations between a model and an individual user or healthcare professional. Responses are evaluated using conversation-specific rubrics created by 262 physicians. Unlike previous multiple-choice or short-answer benchmarks, HealthBench enables realistic, open-ended evaluation through 48,562 unique rubric criteria spanning several health contexts (e.g., emergencies, transforming clinical data, global health) and behavioral dimensions (e.g., accuracy, instruction following, communication). HealthBench performance over the last two years reflects steady initial progress (compare GPT-3.5 Turbo's 16% to GPT-4o's 32%) and more rapid recent improvements (o3 scores 60%). Smaller models have especially improved: GPT-4.1 nano outperforms GPT-4o and is 25 times cheaper. We additionally release two HealthBench variations: HealthBench Consensus, which includes 34 particularly important dimensions of model behavior validated via physician consensus, and HealthBench Hard, where the current top score is 32%. We hope that HealthBench grounds progress towards model development and applications that benefit human health.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM performance and safety in healthcare

Measuring model responses via physician-created rubrics

Assessing improvements in smaller, cost-effective models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-source benchmark for healthcare LLM evaluation

Multi-turn conversations with physician-created rubrics

Includes specialized variations for consensus and difficulty

🔎 Similar Papers

No similar papers found.