ReSW-VL: Representation Learning for Surgical Workflow Analysis Using Vision-Language Model

๐Ÿ“… 2025-05-19
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Convolutional neural networks (CNNs) for surgical phase recognition suffer from heavy reliance on strong supervision and poor generalizability across domains. Method: This paper introduces the first visionโ€“language collaborative representation framework for surgical workflow analysis, integrating the CLIP image encoder with prompt learning. It further incorporates fine-grained temporal modeling and multi-dataset transfer training to alleviate domain adaptation bottlenecks. Contribution/Results: Evaluated on three public surgical phase recognition benchmarks, the proposed method achieves average accuracy improvements of 3.2โ€“5.8 percentage points over state-of-the-art CNN-based temporal models. It significantly enhances intraoperative video understanding and provides robust, semantically grounded representations for real-time clinical decision support, operating room resource scheduling, and surgical skill assessment.

Technology Category

Application Category

๐Ÿ“ Abstract
Surgical phase recognition from video is a technology that automatically classifies the progress of a surgical procedure and has a wide range of potential applications, including real-time surgical support, optimization of medical resources, training and skill assessment, and safety improvement. Recent advances in surgical phase recognition technology have focused primarily on Transform-based methods, although methods that extract spatial features from individual frames using a CNN and video features from the resulting time series of spatial features using time series modeling have shown high performance. However, there remains a paucity of research on training methods for CNNs employed for feature extraction or representation learning in surgical phase recognition. In this study, we propose a method for representation learning in surgical workflow analysis using a vision-language model (ReSW-VL). Our proposed method involves fine-tuning the image encoder of a CLIP (Convolutional Language Image Model) vision-language model using prompt learning for surgical phase recognition. The experimental results on three surgical phase recognition datasets demonstrate the effectiveness of the proposed method in comparison to conventional methods.
Problem

Research questions and friction points this paper is trying to address.

Improving surgical phase recognition using vision-language models
Enhancing CNN feature extraction for surgical workflow analysis
Developing prompt learning for surgical phase classification
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses vision-language model for surgical workflow
Fine-tunes CLIP image encoder with prompt learning
Improves surgical phase recognition accuracy significantly
๐Ÿ”Ž Similar Papers
No similar papers found.