MedSPOT: A Workflow-Aware Sequential Grounding Benchmark for Clinical GUI

📅 2026-03-20

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Existing GUI visual grounding benchmarks struggle to evaluate models’ sequential reasoning and localization capabilities in clinical, multi-step, workflow-driven interactions. To address this gap, this work proposes MedSPOT—the first workflow-aware sequential grounding benchmark tailored for medical GUIs—formulating the task as a structured sequence of spatial decisions. The benchmark comprises 216 task videos and 597 annotated keyframes, with each task involving 2–3 interdependent steps. Innovatively, it introduces a strict sequential evaluation protocol and fine-grained failure categorization—including edge deviation, small-target errors, and toolbar confusion—thereby extending visual grounding beyond isolated predictions to a workflow-aware sequential reasoning paradigm. This framework enables systematic assessment of multimodal models’ procedural robustness in dynamic interfaces, better aligning with the high-stakes demands of clinical environments.

Technology Category

Application Category

📝 Abstract

Despite the rapid progress of Multimodal Large Language Models (MLLMs), their ability to perform reliable visual grounding in high-stakes clinical software environments remains underexplored. Existing GUI benchmarks largely focus on isolated, single-step grounding queries, overlooking the sequential, workflow-driven reasoning required in real-world medical interfaces, where tasks evolve across independent steps and dynamic interface states. We introduce MedSPOT, a workflow-aware sequential grounding benchmark for clinical GUI environments. Unlike prior benchmarks that treat grounding as a standalone prediction task, MedSPOT models procedural interaction as a sequence of structured spatial decisions. The benchmark comprises 216 task-driven videos with 597 annotated keyframes, in which each task consists of 2 to 3 interdependent grounding steps within realistic medical workflows. This design captures interface hierarchies, contextual dependencies, and fine-grained spatial precision under evolving conditions. To evaluate procedural robustness, we propose a strict sequential evaluation protocol that terminates task assessment upon the first incorrect grounding prediction, explicitly measuring error propagation in multi-step workflows. We further introduce a comprehensive failure taxonomy, including edge bias, small-target errors, no prediction, near miss, far miss, and toolbar confusion, to enable systematic diagnosis of model behavior in clinical GUI settings. By shifting evaluation from isolated grounding to workflow-aware sequential reasoning, MedSPOT establishes a realistic and safety-critical benchmark for assessing multimodal models in medical software environments. Code and data are available at: https://github.com/Tajamul21/MedSPOT.

Problem

Research questions and friction points this paper is trying to address.

visual grounding

clinical GUI

sequential reasoning

workflow-aware

multimodal models

Innovation

Methods, ideas, or system contributions that make the work stand out.

sequential visual grounding

workflow-aware benchmark

clinical GUI

multimodal LLM evaluation

error propagation analysis

🔎 Similar Papers

MedRG: Medical Report Grounding with Multi-modal Large Language Model

2024-04-10arXiv.orgCitations: 5