🤖 AI Summary
Current evaluations of medical AI agents are confined to static imaging samples, failing to assess their active interaction capabilities within real-world radiology workflows. This work proposes ABRA, a novel benchmark that models the radiology workflow as an interactive environment wherein agents manipulate an OHIF viewer and an Orthanc DICOM server through 21 distinct tools to perform tasks such as slice navigation, window-level adjustment, sequence selection, coordinate annotation, and structured reporting. ABRA incorporates a procedural task generation mechanism with multiple difficulty levels and types, alongside task-specific automated scoring metrics that separately evaluate planning, execution, and outcome quality. Experiments reveal that while current models achieve high execution success rates (≥89%), their outcome scores remain extremely low (0–25%). Notably, when perceptual inputs are supplied by simulated detectors, outcome scores surge to 69–100%, indicating that the primary performance bottleneck lies in perception rather than tool orchestration.
📝 Abstract
Existing medical-agent benchmarks deliver imaging as pre-selected samples, never as an environment the agent must navigate. We introduce ABRA, a radiology-agent benchmark in which the agent operates an OHIF viewer and an Orthanc DICOM server through twenty-one function-calling tools that span slice navigation, windowing, series selection, pixel-coordinate annotation, and structured reporting. ABRA contains 655 programmatically generated tasks across three difficulty tiers and eight types (viewer control, metadata QA, vision probe, annotation, longitudinal comparison, BI-RADS reporting, and oracle variants of annotation and BI-RADS reporting), drawn from LIDC-IDRI, Duke Breast Cancer MRI, and NLST New-Lesion LongCT. Each episode is scored along Planning, Execution, and Outcome (Bluethgen et al., 2025) by task-type-specific automatic scorers. Ten current models, five closed-weight and five open-weight, reach at least 89% Execution on real annotation but only 0-25% Outcome; on the paired oracle variant where a simulated detector supplies the finding, Outcome on the same task reaches 69-100% across the models evaluated, localising the bottleneck to perception rather than tool orchestration. Code, task generators, and scorers are released at https://github.com/Luab/ABRA