Holistic Surgical Phase Recognition with Hierarchical Input Dependent State Space Models

📅 2025-06-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing methods for full-stage recognition in long robotic-assisted surgical videos suffer from high computational complexity and an inability to jointly model fine-grained local visual details and global temporal dynamics. Method: We propose a Hierarchical Input-dependent State Space Model (HI-SSM), featuring a novel input-driven two-level SSM architecture that jointly encodes local visual features and global surgical progression; a hybrid supervision strategy combining discrete phase labels and continuous progress signals; and efficient linear-time inference that overcomes the quadratic complexity bottleneck of Transformers. The model integrates a visual encoder with a hierarchical SSM head for end-to-end whole-video understanding. Results: On Cholec80, MICCAI2016, and Heichole benchmarks, HI-SSM achieves absolute accuracy improvements of 2.8%, 4.3%, and 12.9%, respectively, significantly outperforming state-of-the-art approaches.

Technology Category

Application Category

📝 Abstract
Surgical workflow analysis is essential in robot-assisted surgeries, yet the long duration of such procedures poses significant challenges for comprehensive video analysis. Recent approaches have predominantly relied on transformer models; however, their quadratic attention mechanism restricts efficient processing of lengthy surgical videos. In this paper, we propose a novel hierarchical input-dependent state space model that leverages the linear scaling property of state space models to enable decision making on full-length videos while capturing both local and global dynamics. Our framework incorporates a temporally consistent visual feature extractor, which appends a state space model head to a visual feature extractor to propagate temporal information. The proposed model consists of two key modules: a local-aggregation state space model block that effectively captures intricate local dynamics, and a global-relation state space model block that models temporal dependencies across the entire video. The model is trained using a hybrid discrete-continuous supervision strategy, where both signals of discrete phase labels and continuous phase progresses are propagated through the network. Experiments have shown that our method outperforms the current state-of-the-art methods by a large margin (+2.8% on Cholec80, +4.3% on MICCAI2016, and +12.9% on Heichole datasets). Code will be publicly available after paper acceptance.
Problem

Research questions and friction points this paper is trying to address.

Efficient analysis of long surgical videos
Capturing local and global surgical dynamics
Improving surgical phase recognition accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical input-dependent state space model
Local and global dynamics capturing
Hybrid discrete-continuous supervision strategy