HealthAdminBench: Evaluating Computer-Use Agents on Healthcare Administration Tasks

📅 2026-04-10

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This study addresses the absence of benchmarks for evaluating large language model (LLM)-driven computer use agents (CUAs) in end-to-end healthcare administrative workflows. The authors introduce the first standardized evaluation environment tailored to the full spectrum of medical administrative tasks, integrating an electronic health record system, two payer portals, and a fax system. The benchmark comprises 135 expert-defined tasks and 1,698 verifiable subtasks. Using diverse prompting strategies and interface observation configurations, the study conducts end-to-end evaluations of leading CUAs within a realistic simulation. Results reveal that even the best-performing model, Claude Opus 4.6 CUA, achieves only a 36.3% task success rate, while GPT-5.4 CUA attains the highest subtask success rate at 82.8%, underscoring a significant gap between current capabilities and real-world deployment readiness and filling a critical void in LLM evaluation beyond clinical applications.

Technology Category

Application Category

📝 Abstract

Healthcare administration accounts for over $1 trillion in annual spending, making it a promising target for LLM-based computer-use agents (CUAs). While clinical applications of LLMs have received significant attention, no benchmark exists for evaluating CUAs on end-to-end administrative workflows. To address this gap, we introduce HealthAdminBench, a benchmark comprising four realistic GUI environments: an EHR, two payer portals, and a fax system, and 135 expert-defined tasks spanning three administrative task types: Prior Authorization, Appeals and Denials Management, and Durable Medical Equipment (DME) Order Processing. Each task is decomposed into fine-grained, verifiable subtasks, yielding 1,698 evaluation points. We evaluate seven agent configurations under multiple prompting and observation settings and find that, despite strong subtask performance, end-to-end reliability remains low: the best-performing agent (Claude Opus 4.6 CUA) achieves only 36.3 percent task success, while GPT-5.4 CUA attains the highest subtask success rate (82.8 percent). These results reveal a substantial gap between current agent capabilities and the demands of real-world administrative workflows. HealthAdminBench provides a rigorous foundation for evaluating progress toward safe and reliable automation of healthcare administrative workflows.

Problem

Research questions and friction points this paper is trying to address.

healthcare administration

computer-use agents

benchmark

LLM

administrative workflows

Innovation

Methods, ideas, or system contributions that make the work stand out.

computer-use agents

healthcare administration

benchmarking