SurgCoT: Advancing Spatiotemporal Reasoning in Surgical Videos through a Chain-of-Thought Benchmark

📅 2026-04-22
📈 Citations: 0
Influential: 0
📄 PDF

career value

171K/year
🤖 AI Summary
This work addresses the lack of systematic evaluation of multimodal large language models (MLLMs) in fine-grained spatiotemporal reasoning on surgical videos, a critical gap hindering their clinical applicability. To bridge this, we introduce SurgCoT—the first chain-of-thought (CoT) reasoning benchmark tailored for surgical video understanding—spanning seven surgical specialties and 35 procedures. SurgCoT employs a structured annotation protocol encompassing questions, options, knowledge, spatiotemporal cues, and answers, enabling comprehensive assessment across five reasoning dimensions, including causal action sequencing and cue-action alignment. By integrating dual guidance from domain knowledge and spatiotemporal cues, SurgCoT establishes a reproducible standard for surgical reasoning evaluation. Evaluations of ten prominent MLLMs reveal that commercial models outperform both open-source and medical-specialized counterparts, while SurgCoT effectively uncovers key deficiencies in current models and fosters improvements in their incremental spatiotemporal reasoning capabilities.

Technology Category

Application Category

📝 Abstract
Fine-grained spatiotemporal reasoning on surgical videos is critical, yet the capabilities of Multi-modal Large Language Models (MLLMs) in this domain remain largely unexplored. To bridge this gap, we introduce SurgCoT, a unified benchmark for evaluating chain-of-thought (CoT) reasoning in MLLMs across 7 surgical specialties and 35 diverse procedures. SurgCoT assesses five core reasoning dimensions: Causal Action Ordering, Cue-Action Alignment, Affordance Mapping, Micro-Transition Localization, and Anomaly Onset Tracking, through a structured CoT framework with an intensive annotation protocol (Question-Option-Knowledge-Clue-Answer), where the Knowledge field provides essential background context and Clue provides definitive spatiotemporal evidence. Evaluation of 10 leading MLLMs shows: 1) commercial models outperform open-source and medical-specialized variants; 2) significant gaps exist in surgical CoT reasoning; 3) SurgCoT enables effective evaluation and enhances progressive spatiotemporal reasoning. SurgCoT provides a reproducible testbed to narrow the gap between MLLM capabilities and clinical reasoning demands. Code: https://github.com/CVI-SZU/SurgCoT.
Problem

Research questions and friction points this paper is trying to address.

spatiotemporal reasoning
surgical videos
Multi-modal Large Language Models
Chain-of-Thought
benchmark
Innovation

Methods, ideas, or system contributions that make the work stand out.

Chain-of-Thought Reasoning
Surgical Video Understanding
Spatiotemporal Reasoning
Multimodal Large Language Models
Structured Benchmark