🤖 AI Summary
This work addresses the lack of systematic evaluation of multimodal large language models (MLLMs) in fine-grained spatiotemporal reasoning on surgical videos, a critical gap hindering their clinical applicability. To bridge this, we introduce SurgCoT—the first chain-of-thought (CoT) reasoning benchmark tailored for surgical video understanding—spanning seven surgical specialties and 35 procedures. SurgCoT employs a structured annotation protocol encompassing questions, options, knowledge, spatiotemporal cues, and answers, enabling comprehensive assessment across five reasoning dimensions, including causal action sequencing and cue-action alignment. By integrating dual guidance from domain knowledge and spatiotemporal cues, SurgCoT establishes a reproducible standard for surgical reasoning evaluation. Evaluations of ten prominent MLLMs reveal that commercial models outperform both open-source and medical-specialized counterparts, while SurgCoT effectively uncovers key deficiencies in current models and fosters improvements in their incremental spatiotemporal reasoning capabilities.
📝 Abstract
Fine-grained spatiotemporal reasoning on surgical videos is critical, yet the capabilities of Multi-modal Large Language Models (MLLMs) in this domain remain largely unexplored. To bridge this gap, we introduce SurgCoT, a unified benchmark for evaluating chain-of-thought (CoT) reasoning in MLLMs across 7 surgical specialties and 35 diverse procedures. SurgCoT assesses five core reasoning dimensions: Causal Action Ordering, Cue-Action Alignment, Affordance Mapping, Micro-Transition Localization, and Anomaly Onset Tracking, through a structured CoT framework with an intensive annotation protocol (Question-Option-Knowledge-Clue-Answer), where the Knowledge field provides essential background context and Clue provides definitive spatiotemporal evidence. Evaluation of 10 leading MLLMs shows: 1) commercial models outperform open-source and medical-specialized variants; 2) significant gaps exist in surgical CoT reasoning; 3) SurgCoT enables effective evaluation and enhances progressive spatiotemporal reasoning. SurgCoT provides a reproducible testbed to narrow the gap between MLLM capabilities and clinical reasoning demands. Code: https://github.com/CVI-SZU/SurgCoT.