🤖 AI Summary
This study addresses the limitations of existing medical agent evaluation benchmarks, which struggle to simulate the long-horizon, multi-step, and verifiable workflows characteristic of real-world clinical practice. To bridge this gap, the authors introduce a novel benchmark built upon a real electronic health record (EHR) system, comprising 100 cross-specialty tasks derived from actual consultation cases spanning 21 specialties and diverse clinical processes. Agents are required to perform integrated operations including data retrieval, clinical reasoning, system interaction, and documentation. For the first time, executable and verifiable tasks are deployed via a standard EHR API, complemented by a structured checkpoint mechanism enabling fine-grained assessment. Experimental results reveal that even the best-performing among 13 leading large language model agents achieves only a 46% one-shot success rate, with open-source models reaching just 19%, underscoring a substantial gap between current capabilities and real-world clinical demands.
📝 Abstract
We introduce PhysicianBench, a benchmark for evaluating LLM agents on physician tasks grounded in real clinical setting within electronic health record (EHR) environments. Existing medical agent benchmarks primarily focus on static knowledge recall, single-step atomic actions, or action intent without verifiable execution against the environment. As a result, they fail to capture the long-horizon, composite workflows that characterize real clinical systems. PhysicianBench comprises 100 long-horizon tasks adapted from real consultation cases between primary care and subspecialty physicians, with each task independently reviewed by a separate panel of physicians. Tasks are instantiated in an EHR environment with real patient records and accessed through the same standard APIs used by commercial EHR vendors. Tasks span 21 specialties (e.g., cardiology, endocrinology, oncology, psychiatry) and diverse workflow types (e.g., diagnosis interpretation, medication prescribing, treatment planning), requiring an average of 27 tool calls per task. Solving each task requires retrieving data across encounters, reasoning over heterogeneous clinical information, executing consequential clinical actions, and producing clinical documentation. Each task is decomposed into structured checkpoints (670 in total across the benchmark) capturing distinct stages of completion graded by task-specific scripts with execution-grounded verification. Across 13 proprietary and open-source LLM agents, the best-performing model achieves only 46% success rate (pass@1), while open-source models reach at most 19%, revealing a substantial gap between current agent capabilities and the demands of real-world clinical workflows. PhysicianBench provides a realistic and execution-grounded benchmark for measuring progress toward autonomous clinical agents.