SemiVT-Surge: Semi-Supervised Video Transformer for Surgical Phase Recognition

📅 2025-06-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the performance bottleneck in surgical phase recognition caused by scarce annotated surgical video data, this paper proposes a semi-supervised learning framework based on video Transformers. The core innovation lies in introducing the first pseudo-labeling framework that jointly integrates temporal consistency regularization and class-prototype contrastive learning, systematically investigating the feasibility of semi-supervised video understanding in surgical scenarios. By leveraging only a small number of labeled videos alongside abundant unlabeled ones, the method significantly reduces reliance on labor-intensive manual annotations. Experimental results demonstrate a 4.9% accuracy improvement on the RAMIE dataset and achieve full-supervision-level performance on Cholec80 using merely 25% of the labeled data, thereby establishing a new benchmark for semi-supervised surgical phase recognition.

Technology Category

Application Category

📝 Abstract
Accurate surgical phase recognition is crucial for computer-assisted interventions and surgical video analysis. Annotating long surgical videos is labor-intensive, driving research toward leveraging unlabeled data for strong performance with minimal annotations. Although self-supervised learning has gained popularity by enabling large-scale pretraining followed by fine-tuning on small labeled subsets, semi-supervised approaches remain largely underexplored in the surgical domain. In this work, we propose a video transformer-based model with a robust pseudo-labeling framework. Our method incorporates temporal consistency regularization for unlabeled data and contrastive learning with class prototypes, which leverages both labeled data and pseudo-labels to refine the feature space. Through extensive experiments on the private RAMIE (Robot-Assisted Minimally Invasive Esophagectomy) dataset and the public Cholec80 dataset, we demonstrate the effectiveness of our approach. By incorporating unlabeled data, we achieve state-of-the-art performance on RAMIE with a 4.9% accuracy increase and obtain comparable results to full supervision while using only 1/4 of the labeled data on Cholec80. Our findings establish a strong benchmark for semi-supervised surgical phase recognition, paving the way for future research in this domain.
Problem

Research questions and friction points this paper is trying to address.

Improving surgical phase recognition with minimal labeled data
Leveraging unlabeled data via semi-supervised learning in surgery
Enhancing accuracy with transformer-based pseudo-labeling framework
Innovation

Methods, ideas, or system contributions that make the work stand out.

Video transformer-based model with pseudo-labeling
Temporal consistency regularization for unlabeled data
Contrastive learning with class prototypes
🔎 Similar Papers
No similar papers found.
Y
Yiping Li
Department of Biomedical Engineering, Medical Image Analysis, Eindhoven University of Technology, Eindhoven, The Netherlands
R
Ronald L. P. D. de Jong
Department of Biomedical Engineering, Medical Image Analysis, Eindhoven University of Technology, Eindhoven, The Netherlands
Sahar Nasirihaghighi
Sahar Nasirihaghighi
Doctoral Candidate, Klagenfurt University, Austria
Deep LearningComputer VisionMedical Video Analysis
T
T. Jaspers
Department of Electrical Engineering, Video Coding & Architectures, Eindhoven University of Technology, Eindhoven, The Netherlands
R
Romy C. van Jaarsveld
Department of Surgery, University Medical Center Utrecht, Utrecht, The Netherlands
G
Gino M. Kuiper
Department of Surgery, University Medical Center Utrecht, Utrecht, The Netherlands
R
R. Hillegersberg
Department of Surgery, University Medical Center Utrecht, Utrecht, The Netherlands
F
F. V. D. Sommen
Department of Electrical Engineering, Video Coding & Architectures, Eindhoven University of Technology, Eindhoven, The Netherlands
J
J. Ruurda
Department of Surgery, University Medical Center Utrecht, Utrecht, The Netherlands
M
M. Breeuwer
Department of Biomedical Engineering, Medical Image Analysis, Eindhoven University of Technology, Eindhoven, The Netherlands
Y
Y. A. Khalil
Department of Biomedical Engineering, Medical Image Analysis, Eindhoven University of Technology, Eindhoven, The Netherlands