PhysicianBench: Evaluating LLM Agents in Real-World EHR Environments

📅 2026-05-04
📈 Citations: 0
Influential: 0
📄 PDF

career value

184K/year
🤖 AI Summary
This study addresses the limitations of existing medical agent evaluation benchmarks, which struggle to simulate the long-horizon, multi-step, and verifiable workflows characteristic of real-world clinical practice. To bridge this gap, the authors introduce a novel benchmark built upon a real electronic health record (EHR) system, comprising 100 cross-specialty tasks derived from actual consultation cases spanning 21 specialties and diverse clinical processes. Agents are required to perform integrated operations including data retrieval, clinical reasoning, system interaction, and documentation. For the first time, executable and verifiable tasks are deployed via a standard EHR API, complemented by a structured checkpoint mechanism enabling fine-grained assessment. Experimental results reveal that even the best-performing among 13 leading large language model agents achieves only a 46% one-shot success rate, with open-source models reaching just 19%, underscoring a substantial gap between current capabilities and real-world clinical demands.
📝 Abstract
We introduce PhysicianBench, a benchmark for evaluating LLM agents on physician tasks grounded in real clinical setting within electronic health record (EHR) environments. Existing medical agent benchmarks primarily focus on static knowledge recall, single-step atomic actions, or action intent without verifiable execution against the environment. As a result, they fail to capture the long-horizon, composite workflows that characterize real clinical systems. PhysicianBench comprises 100 long-horizon tasks adapted from real consultation cases between primary care and subspecialty physicians, with each task independently reviewed by a separate panel of physicians. Tasks are instantiated in an EHR environment with real patient records and accessed through the same standard APIs used by commercial EHR vendors. Tasks span 21 specialties (e.g., cardiology, endocrinology, oncology, psychiatry) and diverse workflow types (e.g., diagnosis interpretation, medication prescribing, treatment planning), requiring an average of 27 tool calls per task. Solving each task requires retrieving data across encounters, reasoning over heterogeneous clinical information, executing consequential clinical actions, and producing clinical documentation. Each task is decomposed into structured checkpoints (670 in total across the benchmark) capturing distinct stages of completion graded by task-specific scripts with execution-grounded verification. Across 13 proprietary and open-source LLM agents, the best-performing model achieves only 46% success rate (pass@1), while open-source models reach at most 19%, revealing a substantial gap between current agent capabilities and the demands of real-world clinical workflows. PhysicianBench provides a realistic and execution-grounded benchmark for measuring progress toward autonomous clinical agents.
Problem

Research questions and friction points this paper is trying to address.

LLM agents
clinical workflows
electronic health records
benchmark evaluation
real-world execution
Innovation

Methods, ideas, or system contributions that make the work stand out.

PhysicianBench
LLM agents
EHR environment
long-horizon tasks
execution-grounded evaluation
R
Ruoqi Liu
Stanford University
I
Imran Q. Mohiuddin
Stanford University
A
Austin J. Schoeffler
Stanford University
K
Kavita Renduchintala
Stanford University
Ashwin Nayak
Ashwin Nayak
University of Waterloo
Quantum ComputationQuantum InformationTheoretical Computer Science
P
Prasantha L. Vemu
Stanford University
S
Shivam C. Vedak
Stanford University
K
Kameron C. Black
Stanford University
J
John L. Havlik
Stanford University
I
Isaac Ogunmola
Stanford University
S
Stephen P. Ma
Stanford University
R
Roopa Dhatt
Stanford University
J
Jonathan H. Chen
Stanford University