🤖 AI Summary
Existing video evaluation benchmarks struggle to assess the ability of multimodal large language models (MLLMs) to dynamically reason about interaction-driven procedural state changes in clinical skills videos. To address this gap, this work introduces SiMing-Score, the first benchmark specifically designed for end-to-end clinical procedure evaluation. Built upon real-world videos annotated by physicians under double-blind conditions and aligned with standardized scoring rubrics, SiMing-Score proposes a novel evaluation paradigm centered on “procedural correctness” through dynamic state updates. Step-level alignment analysis reveals that while prevailing MLLMs exhibit moderate correlation with expert judgments at the global procedural level, they significantly underperform in fine-grained assessment of step-wise procedural correctness, highlighting their limited capacity to model state evolution triggered by continuous human–environment interactions.
📝 Abstract
Current video benchmarks for multimodal large language models (MLLMs) focus on event recognition, temporal ordering, and long-context recall, but overlook a harder capability required for expert procedural judgment: tracking how ongoing interactions update the procedural state and thereby determine the correctness of later actions. We introduce SiMing-Bench, the first benchmark for evaluating this capability from full-length clinical skill videos. It targets rubric-grounded process-level judgment of whether interaction-driven state updates preserve procedural correctness across an entire workflow. SiMing-Bench is instantiated with SiMing-Score, a physician-annotated dataset of real clinical skill examination videos spanning cardiopulmonary resuscitation, automated external defibrillator operation, and bag-mask ventilation, each paired with a standardized step-wise rubric and dual-expert labels. Across diverse open- and closed-source MLLMs, we observe consistently weak agreement with physician judgments. Moreover, weak performance on rubric-defined intermediate steps persists even when overall procedure-level correlation appears acceptable, suggesting that coarse global assessment substantially overestimates current models'procedural judgment ability. Additional analyses with binary step judgment and step-aligned clips indicate that the bottleneck is not merely fine-grained scoring or temporal localization, but modeling how continuous interactions update procedural state over time.