Large-scale Self-supervised Video Foundation Model for Intelligent Surgery

📅 2025-06-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current AI-based surgical video understanding methods predominantly rely on self-supervised spatial representation learning but lack explicit temporal modeling, hindering comprehensive capture of dynamic surgical context. To address this, we propose SurgVISTA—the first large-scale self-supervised pretraining framework tailored for surgical videos—featuring (1) video-level joint spatiotemporal representation learning and (2) a surgical-expert-guided image-level knowledge distillation mechanism. SurgVISTA is pretrained via reconstruction on 3,650 surgical videos (3.55M frames) and evaluated on a novel multi-task benchmark comprising 13 datasets spanning six surgical procedures and four task categories. Extensive experiments demonstrate that SurgVISTA consistently outperforms existing models—both in natural-domain and surgical-domain settings—achieving significant gains in clinically relevant surgical video understanding tasks.

Technology Category

Application Category

📝 Abstract
Computer-Assisted Intervention (CAI) has the potential to revolutionize modern surgery, with surgical scene understanding serving as a critical component in supporting decision-making, improving procedural efficacy, and ensuring intraoperative safety. While existing AI-driven approaches alleviate annotation burdens via self-supervised spatial representation learning, their lack of explicit temporal modeling during pre-training fundamentally restricts the capture of dynamic surgical contexts, resulting in incomplete spatiotemporal understanding. In this work, we introduce the first video-level surgical pre-training framework that enables joint spatiotemporal representation learning from large-scale surgical video data. To achieve this, we constructed a large-scale surgical video dataset comprising 3,650 videos and approximately 3.55 million frames, spanning more than 20 surgical procedures and over 10 anatomical structures. Building upon this dataset, we propose SurgVISTA (Surgical Video-level Spatial-Temporal Architecture), a reconstruction-based pre-training method that captures intricate spatial structures and temporal dynamics through joint spatiotemporal modeling. Additionally, SurgVISTA incorporates image-level knowledge distillation guided by a surgery-specific expert to enhance the learning of fine-grained anatomical and semantic features. To validate its effectiveness, we established a comprehensive benchmark comprising 13 video-level datasets spanning six surgical procedures across four tasks. Extensive experiments demonstrate that SurgVISTA consistently outperforms both natural- and surgical-domain pre-trained models, demonstrating strong potential to advance intelligent surgical systems in clinically meaningful scenarios.
Problem

Research questions and friction points this paper is trying to address.

Lack of temporal modeling in surgical AI limits dynamic context capture
Need for joint spatiotemporal learning in surgical video pre-training
Incomplete understanding of surgical scenes affects decision-making and safety
Innovation

Methods, ideas, or system contributions that make the work stand out.

Joint spatiotemporal representation learning from surgical videos
Reconstruction-based pre-training for spatial-temporal modeling
Surgery-specific knowledge distillation for fine-grained features
🔎 Similar Papers
No similar papers found.
S
Shu Yang
Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Hong Kong SAR, China
Fengtao Zhou
Fengtao Zhou
Hong Kong University of Science and Technology
Multimodal LearningComputational Pathology
L
Leon D. Mayer
Division of Intelligent Medical Systems, German Cancer Research Center (DKFZ) Heidelberg, Heidelberg, Germany; Faculty of Medicine, Heidelberg University Hospital, Heidelberg, Germany
Fuxiang Huang
Fuxiang Huang
The Hong Kong University of Science and Technology (HKUST)
Multimodal LearningFoundation model for Vertical DomainDomain Adaptation
Y
Yiliang Chen
School of Nursing, The Hong Kong Polytechnic University, Hong Kong SAR, China
Y
Yihui Wang
Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Hong Kong SAR, China
Sunan He
Sunan He
Hong Kong University of Science and Technology
Multi-Modal Learning
Yuxiang Nie
Yuxiang Nie
Hong Kong University of Science and Technology
Natural language processingMulti-modal LearningMedical Image Analysis
X
Xi Wang
Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Hong Kong SAR, China
O
Omer Sumer
Division of Intelligent Medical Systems, German Cancer Research Center (DKFZ) Heidelberg, Heidelberg, Germany; Faculty of Medicine, Heidelberg University Hospital, Heidelberg, Germany
Yueming Jin
Yueming Jin
Assistant Professor, National University of Singapore
Medical Image AnalysisSurgical AI&RoboticsMultimodal Learning
H
Huihui Sun
Department of Gastroenterology of Tongji Hospital, School of Medicine, Tongji University, Shanghai, China
Shuchang Xu
Shuchang Xu
Hong Kong University of Science and Technology
Human-Computer InteractionAccessibilityMachine LearningWearables
Alex Qinyang Liu
Alex Qinyang Liu
Prince of Wales Hospital / Chinese University of Hong Kong
Z
Zheng Li
Department of Surgery, The Chinese University of Hong Kong, Hong Kong SAR, China
Jing Qin
Jing Qin
University of Southern Denmark
MathematicsStatistics
J
J. Teoh
Department of Surgery, The Chinese University of Hong Kong, Hong Kong SAR, China
L
Lena Maier-Hein
Division of Intelligent Medical Systems, German Cancer Research Center (DKFZ) Heidelberg, Heidelberg, Germany; Faculty of Medicine, Heidelberg University Hospital, Heidelberg, Germany; HI Helmholtz Imaging, German Cancer Research Center (DKFZ) Heidelberg, Heidelberg, Germany; Faculty of Mathematics and Computer Science, Heidelberg University, Heidelberg, Germany; National Center for Tumor Diseases (NCT), NCT Heidelberg, Heidelberg, Germany
H
Hao Chen
Department of Chemical and Biological Engineering, The Hong Kong University of Science and Technology, Hong Kong SAR, China; Division of Life Science, The Hong Kong University of Science and Technology, Hong Kong SAR, China; State Key Laboratory of Molecular Neuroscience, The Hong Kong University of Science and Technology, Hong Kong SAR, China; Shenzhen-Hong Kong Collaborative Innovation Research Institute, The Hong Kong University of Science and Technology, Shenzhen, China