🤖 AI Summary
This study addresses the critical challenge of surgical phase recognition in pituitary tumor resection videos, which is essential for intelligent surgical analysis, intraoperative decision support, and surgical training, yet hindered by scarce annotations and severe class imbalance. The authors propose a novel framework integrating self-supervised representation learning, temporal modeling, and dynamic sampling-based fine-tuning. They further introduce an interactive online platform co-designed with surgeons to enable model deployment, data feedback, and continuous optimization in a closed loop. Leveraging ResNet-50 pretrained via self-supervision, combined with focal loss and a progressive unfreezing strategy, the method achieves 90% accuracy on an independent test set, significantly outperforming existing approaches while demonstrating strong generalization capability.
📝 Abstract
Accurate surgical phase recognition is essential for analyzing procedural workflows, supporting intraoperative decision-making, and enabling data-driven improvements in surgical education and performance evaluation. In this work, we present a comprehensive framework for phase recognition in pituitary tumor surgery (PTS) videos, combining self-supervised representation learning, robust temporal modeling, and scalable data annotation strategies. Our method achieves 90\% accuracy on a held-out test set, outperforming current state-of-the-art approaches and demonstrating strong generalization across variable surgical cases.
A central contribution of this work is the integration of a collaborative online platform designed for surgeons to upload surgical videos, receive automated phase analysis, and contribute to a growing dataset. This platform not only facilitates large-scale data collection but also fosters knowledge sharing and continuous model improvement. To address the challenge of limited labeled data, we pretrain a ResNet-50 model using the self-supervised framework on 251 unlabeled PTS videos, enabling the extraction of high-quality feature representations. Fine-tuning is performed on a labeled dataset of 81 procedures using a modified training regime that incorporates focal loss, gradual layer unfreezing, and dynamic sampling to address class imbalance and procedural variability.