🤖 AI Summary
Existing methods for full-stage recognition in long robotic-assisted surgical videos suffer from high computational complexity and an inability to jointly model fine-grained local visual details and global temporal dynamics. Method: We propose a Hierarchical Input-dependent State Space Model (HI-SSM), featuring a novel input-driven two-level SSM architecture that jointly encodes local visual features and global surgical progression; a hybrid supervision strategy combining discrete phase labels and continuous progress signals; and efficient linear-time inference that overcomes the quadratic complexity bottleneck of Transformers. The model integrates a visual encoder with a hierarchical SSM head for end-to-end whole-video understanding. Results: On Cholec80, MICCAI2016, and Heichole benchmarks, HI-SSM achieves absolute accuracy improvements of 2.8%, 4.3%, and 12.9%, respectively, significantly outperforming state-of-the-art approaches.
📝 Abstract
Surgical workflow analysis is essential in robot-assisted surgeries, yet the long duration of such procedures poses significant challenges for comprehensive video analysis. Recent approaches have predominantly relied on transformer models; however, their quadratic attention mechanism restricts efficient processing of lengthy surgical videos. In this paper, we propose a novel hierarchical input-dependent state space model that leverages the linear scaling property of state space models to enable decision making on full-length videos while capturing both local and global dynamics. Our framework incorporates a temporally consistent visual feature extractor, which appends a state space model head to a visual feature extractor to propagate temporal information. The proposed model consists of two key modules: a local-aggregation state space model block that effectively captures intricate local dynamics, and a global-relation state space model block that models temporal dependencies across the entire video. The model is trained using a hybrid discrete-continuous supervision strategy, where both signals of discrete phase labels and continuous phase progresses are propagated through the network. Experiments have shown that our method outperforms the current state-of-the-art methods by a large margin (+2.8% on Cholec80, +4.3% on MICCAI2016, and +12.9% on Heichole datasets). Code will be publicly available after paper acceptance.