Scaling Video Pretraining for Surgical Foundation Models

πŸ“… 2026-03-31
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Surgical video understanding has been hindered by limited data scale, narrow procedural diversity, inconsistent evaluation protocols, and non-reproducible training pipelines. To address these challenges, this work proposes SurgRecβ€”a scalable and reproducible self-supervised pretraining framework for surgical videos, featuring two variants: SurgRec-MAE and SurgRec-JEPA. The study introduces the first large-scale, multi-source surgical video corpus encompassing diverse procedures, integrated with a balanced sampling strategy and a unified downstream evaluation benchmark. This approach substantially enhances model generalization across tasks. Evaluated on 16 downstream datasets, SurgRec consistently outperforms existing self-supervised and vision-language methods, demonstrating particularly robust performance in fine-grained temporal recognition tasks.
πŸ“ Abstract
Surgical video understanding is essential for computer-assisted interventions, yet existing surgical foundation models remain constrained by limited data scale, procedural diversity, and inconsistent evaluation, often lacking a reproducible training pipeline. We propose SurgRec, a scalable and reproducible pretraining recipe for surgical video understanding, instantiated with two variants: SurgRec-MAE and SurgRec-JEPA. We curate a large multi-source corpus of 10,535 videos and 214.5M frames spanning endoscopy, laparoscopy, cataract, and robotic surgery. Building on this corpus, we develop a unified pretraining pipeline with balanced sampling and standardize a reproducible benchmark across 16 downstream datasets and four clinical domains with consistent data splits. Across extensive comparisons against SSL baselines and vision-language models, SurgRec consistently achieves superior performance across downstream datasets. In contrast, VLMs prove unreliable for fine-grained temporal recognition, exhibiting both performance gaps and sensitivity to prompt phrasing. Our work provides a reproducible, scalable foundation for the community to build more general surgical video models. All code, models, and data will be publicly released.
Problem

Research questions and friction points this paper is trying to address.

surgical video understanding
data scale
procedural diversity
evaluation consistency
reproducible training pipeline
Innovation

Methods, ideas, or system contributions that make the work stand out.

surgical video understanding
scalable pretraining
reproducible benchmark
multi-source corpus
foundation model
πŸ”Ž Similar Papers
No similar papers found.
S
Sicheng Lu
The Johns Hopkins University, United States
Z
Zikai Xiao
Zhejiang University, China
J
Jianhui Wei
Zhejiang University-University of Illinois Urbana-Champaign Institute, China
Danyu Sun
Danyu Sun
Zhejiang University / University of Illinois Urbana Champaign
Q
Qi Lu
Zhejiang Lab, China
K
Keli Hu
Shaoxing University, China
Yang Feng
Yang Feng
Senior scientist in Angelalign
medical image processingAI in orthodonticClear aligner CAD/CAMMesh & point cloud algorithmsDeep learning
J
Jian Wu
Zhejiang University, China
Z
Zongxin Yang
Harvard Medical School, United States
Zuozhu Liu
Zuozhu Liu
Assistant Professor, Zhejiang University/University of Illinois Urbana-Champaign
deep learningvision-language modelsmedical AI