🤖 AI Summary
This work addresses the absence of a standardized benchmark for evaluating language models’ ability to assess clinical acuity—the urgency of a patient’s condition—and guide users toward appropriate care, particularly under uncertainty. We introduce AcuityBench, the first unified evaluation framework for clinical acuity recognition, which integrates five heterogeneous data sources mapped onto a four-tier acuity schema and supports both question-answering classification and open-ended dialogue tasks. A novel subset of physician-validated ambiguous cases is incorporated, enabling systematic assessment through a scoring-rule-driven dialogue evaluator and distributional comparisons between model outputs and expert judgments. Experiments across twelve state-of-the-art large language models reveal that dialogue-based interaction reduces over-triage but increases missed diagnoses, especially in high-acuity scenarios, and that all models struggle to replicate clinicians’ nuanced uncertainty in ambiguous cases.
📝 Abstract
We introduce AcuityBench, a benchmark for evaluating whether language models identify the appropriate urgency of care from user medical presentations. Existing health benchmarks emphasize medical question answering, broad health interactions, or narrow workflow-specific triage tasks, but they do not offer a unified evaluation of acuity identification across these settings. AcuityBench addresses this gap by harmonizing five public datasets spanning user conversations, online forum posts, clinical vignettes, and patient portal messages under a shared four-level acuity framework ranging from home monitoring to immediate emergency care. The benchmark contains 914 cases, including 697 consensus cases for standard accuracy evaluation and 217 physician-confirmed ambiguous cases for uncertainty-aware evaluation. It supports two complementary task formats: explicit four-way classification in a QA setting, and free-form conversational responses evaluated with a rubric-based judge anchored to the same framework. Across 12 frontier proprietary and open-weight models, we find substantial variation in clear-case acuity accuracy and error direction. Comparing task formats reveals a systematic tradeoff: conversational responses reduce over-triage but increase under-triage relative to QA, especially in higher-acuity cases. In ambiguous cases, no model closely matches the distribution of physician judgments, and model predictions are more concentrated than expert clinical uncertainty. We also compare expert and model adjudication on a subset of maximally ambiguous cases, using those cases to examine the role of clinical uncertainty in label disagreement. Together, these results position acuity identification as a distinct safety-critical capability and show that AcuityBench enables systematic comparison and stress-testing of how well models guide users to the right level of care in real-world health use.