MedSPOT: A Workflow-Aware Sequential Grounding Benchmark for Clinical GUI

📅 2026-03-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing GUI visual grounding benchmarks struggle to evaluate models’ sequential reasoning and localization capabilities in clinical, multi-step, workflow-driven interactions. To address this gap, this work proposes MedSPOT—the first workflow-aware sequential grounding benchmark tailored for medical GUIs—formulating the task as a structured sequence of spatial decisions. The benchmark comprises 216 task videos and 597 annotated keyframes, with each task involving 2–3 interdependent steps. Innovatively, it introduces a strict sequential evaluation protocol and fine-grained failure categorization—including edge deviation, small-target errors, and toolbar confusion—thereby extending visual grounding beyond isolated predictions to a workflow-aware sequential reasoning paradigm. This framework enables systematic assessment of multimodal models’ procedural robustness in dynamic interfaces, better aligning with the high-stakes demands of clinical environments.

Technology Category

Application Category

📝 Abstract
Despite the rapid progress of Multimodal Large Language Models (MLLMs), their ability to perform reliable visual grounding in high-stakes clinical software environments remains underexplored. Existing GUI benchmarks largely focus on isolated, single-step grounding queries, overlooking the sequential, workflow-driven reasoning required in real-world medical interfaces, where tasks evolve across independent steps and dynamic interface states. We introduce MedSPOT, a workflow-aware sequential grounding benchmark for clinical GUI environments. Unlike prior benchmarks that treat grounding as a standalone prediction task, MedSPOT models procedural interaction as a sequence of structured spatial decisions. The benchmark comprises 216 task-driven videos with 597 annotated keyframes, in which each task consists of 2 to 3 interdependent grounding steps within realistic medical workflows. This design captures interface hierarchies, contextual dependencies, and fine-grained spatial precision under evolving conditions. To evaluate procedural robustness, we propose a strict sequential evaluation protocol that terminates task assessment upon the first incorrect grounding prediction, explicitly measuring error propagation in multi-step workflows. We further introduce a comprehensive failure taxonomy, including edge bias, small-target errors, no prediction, near miss, far miss, and toolbar confusion, to enable systematic diagnosis of model behavior in clinical GUI settings. By shifting evaluation from isolated grounding to workflow-aware sequential reasoning, MedSPOT establishes a realistic and safety-critical benchmark for assessing multimodal models in medical software environments. Code and data are available at: https://github.com/Tajamul21/MedSPOT.
Problem

Research questions and friction points this paper is trying to address.

visual grounding
clinical GUI
sequential reasoning
workflow-aware
multimodal models
Innovation

Methods, ideas, or system contributions that make the work stand out.

sequential visual grounding
workflow-aware benchmark
clinical GUI
multimodal LLM evaluation
error propagation analysis
🔎 Similar Papers
No similar papers found.
R
Rozain Shakeel
Gaash Research Lab, National Institute of Technology Srinagar, India
A
Abdul Rahman Mohammad Ali
e& Group, UAE
M
Muneeb Mushtaq
Gaash Research Lab, National Institute of Technology Srinagar, India
T
Tausifa Jan Saleem
Mohammad Bin Zayed University of Artificial Intelligence (MBZUAI), UAE
Tajamul Ashraf
Tajamul Ashraf
IIT Delhi, MBZUAI
Computer VisionDeep Learning