HealthBench Professional: Evaluating Large Language Models on Real Clinician Chats

📅 2026-04-30

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This study addresses the lack of systematic evaluation benchmarks for large language models (LLMs) in real-world clinical settings. To bridge this gap, the authors introduce HealthBench Professional, an open-source benchmark centered on three core clinical tasks: diagnostic consultation, medical documentation, and biomedical research. For the first time, it constructs a high-quality, representative challenge set from large-scale authentic physician–patient dialogues. The benchmark incorporates adversarial testing by clinicians and a multi-stage expert rating protocol, alongside collecting baseline responses from human physicians to ensure evaluation reliability. Experimental results demonstrate that ChatGPT for Clinicians (based on GPT-5.4) significantly outperforms base models, leading commercial alternatives, and even human physicians, establishing itself as the current state-of-the-art in handling real-world clinical tasks.

📝 Abstract

Millions of clinicians use ChatGPT to support clinical care, but evaluations of the most common use cases in model-clinician conversations are limited. We introduce HealthBench Professional, an open benchmark for evaluating large language models on real tasks that clinicians bring to ChatGPT in the course of their work. The benchmark is organized around three common use cases central to clinical practice: care consult, writing and documentation, and medical research. Each example includes a physician-authored conversation with ChatGPT for Clinicians and is scored via rubrics written and iteratively adjudicated by three or more physicians across three phases. HealthBench Professional examples were carefully selected for quality, representativeness, and difficulty for OpenAI's current frontier models, to enable continued measurement of progress. Difficult examples for recent OpenAI models were enriched by roughly 3.5 times relative to the candidate pool of 15,079 examples. Additionally, about one-third of examples involve physicians conducting deliberate adversarial testing of models. As a strong baseline, we also collected human physician responses for all tasks (unbounded time, specialist-matched, web access). The best scoring system, GPT-5.4 in ChatGPT for Clinicians, outperforms base GPT-5.4, all other models, and human physicians. We hope HealthBench Professional provides the healthcare AI community a measure to track frontier model progress in real-world clinical tasks and build systems that clinicians can trust to improve care.

Problem

Research questions and friction points this paper is trying to address.

large language models

clinical evaluation

real-world tasks

healthcare AI

clinician-chat interactions

Innovation

Methods, ideas, or system contributions that make the work stand out.

HealthBench Professional

clinical evaluation benchmark

real-world clinician chats