Scaling Video Pretraining for Surgical Foundation Models

📅 2026-03-31

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

Surgical video understanding has been hindered by limited data scale, narrow procedural diversity, inconsistent evaluation protocols, and non-reproducible training pipelines. To address these challenges, this work proposes SurgRec—a scalable and reproducible self-supervised pretraining framework for surgical videos, featuring two variants: SurgRec-MAE and SurgRec-JEPA. The study introduces the first large-scale, multi-source surgical video corpus encompassing diverse procedures, integrated with a balanced sampling strategy and a unified downstream evaluation benchmark. This approach substantially enhances model generalization across tasks. Evaluated on 16 downstream datasets, SurgRec consistently outperforms existing self-supervised and vision-language methods, demonstrating particularly robust performance in fine-grained temporal recognition tasks.

Technology Category

Application Category

📝 Abstract

Surgical video understanding is essential for computer-assisted interventions, yet existing surgical foundation models remain constrained by limited data scale, procedural diversity, and inconsistent evaluation, often lacking a reproducible training pipeline. We propose SurgRec, a scalable and reproducible pretraining recipe for surgical video understanding, instantiated with two variants: SurgRec-MAE and SurgRec-JEPA. We curate a large multi-source corpus of 10,535 videos and 214.5M frames spanning endoscopy, laparoscopy, cataract, and robotic surgery. Building on this corpus, we develop a unified pretraining pipeline with balanced sampling and standardize a reproducible benchmark across 16 downstream datasets and four clinical domains with consistent data splits. Across extensive comparisons against SSL baselines and vision-language models, SurgRec consistently achieves superior performance across downstream datasets. In contrast, VLMs prove unreliable for fine-grained temporal recognition, exhibiting both performance gaps and sensitivity to prompt phrasing. Our work provides a reproducible, scalable foundation for the community to build more general surgical video models. All code, models, and data will be publicly released.

Problem

Research questions and friction points this paper is trying to address.

surgical video understanding

data scale

procedural diversity

evaluation consistency

reproducible training pipeline

Innovation

Methods, ideas, or system contributions that make the work stand out.

surgical video understanding

scalable pretraining

reproducible benchmark