AcuityBench: Evaluating Clinical Acuity Identification and Uncertainty Alignment

📅 2026-05-11
📈 Citations: 0
Influential: 0
📄 PDF

career value

180K/year
🤖 AI Summary
This work addresses the absence of a standardized benchmark for evaluating language models’ ability to assess clinical acuity—the urgency of a patient’s condition—and guide users toward appropriate care, particularly under uncertainty. We introduce AcuityBench, the first unified evaluation framework for clinical acuity recognition, which integrates five heterogeneous data sources mapped onto a four-tier acuity schema and supports both question-answering classification and open-ended dialogue tasks. A novel subset of physician-validated ambiguous cases is incorporated, enabling systematic assessment through a scoring-rule-driven dialogue evaluator and distributional comparisons between model outputs and expert judgments. Experiments across twelve state-of-the-art large language models reveal that dialogue-based interaction reduces over-triage but increases missed diagnoses, especially in high-acuity scenarios, and that all models struggle to replicate clinicians’ nuanced uncertainty in ambiguous cases.
📝 Abstract
We introduce AcuityBench, a benchmark for evaluating whether language models identify the appropriate urgency of care from user medical presentations. Existing health benchmarks emphasize medical question answering, broad health interactions, or narrow workflow-specific triage tasks, but they do not offer a unified evaluation of acuity identification across these settings. AcuityBench addresses this gap by harmonizing five public datasets spanning user conversations, online forum posts, clinical vignettes, and patient portal messages under a shared four-level acuity framework ranging from home monitoring to immediate emergency care. The benchmark contains 914 cases, including 697 consensus cases for standard accuracy evaluation and 217 physician-confirmed ambiguous cases for uncertainty-aware evaluation. It supports two complementary task formats: explicit four-way classification in a QA setting, and free-form conversational responses evaluated with a rubric-based judge anchored to the same framework. Across 12 frontier proprietary and open-weight models, we find substantial variation in clear-case acuity accuracy and error direction. Comparing task formats reveals a systematic tradeoff: conversational responses reduce over-triage but increase under-triage relative to QA, especially in higher-acuity cases. In ambiguous cases, no model closely matches the distribution of physician judgments, and model predictions are more concentrated than expert clinical uncertainty. We also compare expert and model adjudication on a subset of maximally ambiguous cases, using those cases to examine the role of clinical uncertainty in label disagreement. Together, these results position acuity identification as a distinct safety-critical capability and show that AcuityBench enables systematic comparison and stress-testing of how well models guide users to the right level of care in real-world health use.
Problem

Research questions and friction points this paper is trying to address.

clinical acuity
language models
health benchmark
triage
uncertainty alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

acuity identification
clinical uncertainty
healthcare benchmark
triage evaluation
language models
🔎 Similar Papers
No similar papers found.
R
Robin Linzmayer
Department of Computer Science, Columbia University, New York, NY, USA; Department of Biomedical Informatics, Columbia University, New York, NY, USA
G
Georgianna Lin
Department of Biomedical Informatics, Columbia University, New York, NY, USA
D
Di Coneybeare
Department of Emergency Medicine, Columbia University Irving Medical Center, New York, NY, USA
J
Jason Chu
Department of Emergency Medicine, Columbia University Irving Medical Center, New York, NY, USA
T
Trudi Cloyd
Department of Emergency Medicine, Columbia University Irving Medical Center, New York, NY, USA
Manish Garg
Manish Garg
Group Leader (W2), Max Planck Institute for Solid State Research
Attosecond ScienceNanoscale Science
M
Miles Gordon
Department of Emergency Medicine, Columbia University Irving Medical Center, New York, NY, USA
E
Elizabeth Hartofilis
Department of Emergency Medicine, Columbia University Irving Medical Center, New York, NY, USA
B
Benjamin Hong
Department of Emergency Medicine, Columbia University Irving Medical Center, New York, NY, USA
A
Ashraf Hussain
Department of Emergency Medicine, Columbia University Irving Medical Center, New York, NY, USA
E
Eugene Y. Kim
Department of Emergency Medicine, Columbia University Irving Medical Center, New York, NY, USA
O
Oluchi Iheagwara King
Department of Emergency Medicine, Columbia University Irving Medical Center, New York, NY, USA
R
Ross McCormack
Department of Emergency Medicine, Columbia University Irving Medical Center, New York, NY, USA
E
Erica Olsen
Department of Emergency Medicine, Columbia University Irving Medical Center, New York, NY, USA
J
John K. Riggins Jr
Department of Emergency Medicine, Columbia University Irving Medical Center, New York, NY, USA
M
Mustafa N. Rasheed
Department of Emergency Medicine, Columbia University Irving Medical Center, New York, NY, USA
D
Dana L. Sacco
Department of Emergency Medicine, Columbia University Irving Medical Center, New York, NY, USA
V
Vinay Saggar
Department of Emergency Medicine, Columbia University Irving Medical Center, New York, NY, USA
O
Osman R. Sayan
Department of Emergency Medicine, Columbia University Irving Medical Center, New York, NY, USA
A
Amit Shembekar
Department of Emergency Medicine, Columbia University Irving Medical Center, New York, NY, USA
J
Janice Shin-Kim
Department of Emergency Medicine, Columbia University Irving Medical Center, New York, NY, USA
W
Wendy W. Sun
Department of Emergency Medicine, Columbia University Irving Medical Center, New York, NY, USA
Bernard P. Chang
Bernard P. Chang
Associate Dean, Tushar Shah and Sara Zion Associate Professor of Emergency Medicine
Emergency MedicineStrokePTSDsuicideClinician Well-Being
David Kessler
David Kessler
Columbia University Vagelos College of Physicians and Surgeons
SimulationUltrasoundTechnologyInnovationPediatric Emergency Medicine
Noémie Elhadad
Noémie Elhadad
Associate Professor and Chair of Biomedical Informatics, Columbia University
machine learning for healthcarehealth informaticsnatural language processingbiomedical informaticswomen's health