๐ค AI Summary
Convolutional neural networks (CNNs) for surgical phase recognition suffer from heavy reliance on strong supervision and poor generalizability across domains. Method: This paper introduces the first visionโlanguage collaborative representation framework for surgical workflow analysis, integrating the CLIP image encoder with prompt learning. It further incorporates fine-grained temporal modeling and multi-dataset transfer training to alleviate domain adaptation bottlenecks. Contribution/Results: Evaluated on three public surgical phase recognition benchmarks, the proposed method achieves average accuracy improvements of 3.2โ5.8 percentage points over state-of-the-art CNN-based temporal models. It significantly enhances intraoperative video understanding and provides robust, semantically grounded representations for real-time clinical decision support, operating room resource scheduling, and surgical skill assessment.
๐ Abstract
Surgical phase recognition from video is a technology that automatically classifies the progress of a surgical procedure and has a wide range of potential applications, including real-time surgical support, optimization of medical resources, training and skill assessment, and safety improvement. Recent advances in surgical phase recognition technology have focused primarily on Transform-based methods, although methods that extract spatial features from individual frames using a CNN and video features from the resulting time series of spatial features using time series modeling have shown high performance. However, there remains a paucity of research on training methods for CNNs employed for feature extraction or representation learning in surgical phase recognition. In this study, we propose a method for representation learning in surgical workflow analysis using a vision-language model (ReSW-VL). Our proposed method involves fine-tuning the image encoder of a CLIP (Convolutional Language Image Model) vision-language model using prompt learning for surgical phase recognition. The experimental results on three surgical phase recognition datasets demonstrate the effectiveness of the proposed method in comparison to conventional methods.