UniSurg: A Video-Native Foundation Model for Universal Understanding of Surgical Videos

📅 2026-02-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of existing foundation models for surgical videos, which overly focus on low-level visual artifacts such as smoke and specular reflections while failing to capture high-level semantic structures. To overcome this, the authors propose a native video foundation model tailored for surgical videos, shifting the learning objective from pixel-level reconstruction to latent motion prediction. Built upon the V-JEPA architecture, the method introduces three key innovations: motion-guided latent prediction, spatiotemporal affinity-based self-distillation, and feature diversity regularization. The model is pretrained on UniSurg-15M, a large-scale surgical video dataset, and demonstrates state-of-the-art performance across 17 benchmark tasks, including surgical phase recognition, action triplet understanding, skill assessment, polyp segmentation, and depth estimation.

Technology Category

Application Category

📝 Abstract
While foundation models have advanced surgical video analysis, current approaches rely predominantly on pixel-level reconstruction objectives that waste model capacity on low-level visual details - such as smoke, specular reflections, and fluid motion - rather than semantic structures essential for surgical understanding. We present UniSurg, a video-native foundation model that shifts the learning paradigm from pixel-level reconstruction to latent motion prediction. Built on the Video Joint Embedding Predictive Architecture (V-JEPA), UniSurg introduces three key technical innovations tailored to surgical videos: 1) motion-guided latent prediction to prioritize semantically meaningful regions, 2) spatiotemporal affinity self-distillation to enforce relational consistency, and 3) feature diversity regularization to prevent representation collapse in texture-sparse surgical scenes. To enable large-scale pretraining, we curate UniSurg-15M, the largest surgical video dataset to date, comprising 3,658 hours of video from 50 sources across 13 anatomical regions. Extensive experiments across 17 benchmarks demonstrate that UniSurg significantly outperforms state-of-the-art methods on surgical workflow recognition (+14.6% F1 on EgoSurgery, +10.3% on PitVis), action triplet recognition (39.54% mAP-IVT on CholecT50), skill assessment, polyp segmentation, and depth estimation. These results establish UniSurg as a new standard for universal, motion-oriented surgical video understanding.
Problem

Research questions and friction points this paper is trying to address.

surgical video understanding
foundation model
pixel-level reconstruction
semantic structure
motion prediction
Innovation

Methods, ideas, or system contributions that make the work stand out.

latent motion prediction
motion-guided prediction
spatiotemporal affinity self-distillation
feature diversity regularization
video-native foundation model
🔎 Similar Papers
No similar papers found.
Jinlin Wu
Jinlin Wu
Institute of Automation,Chinese Academy of Sciences
Felix Holm
Felix Holm
Technische Universität München
Medical AISurgical Data Science
C
Chuxi Chen
Center for Artificial Intelligence and Robotics, Hong Kong Institute of Science and Innovation, Chinese Academy of Sciences, Hong Kong, China
An Wang
An Wang
The Chinese University of Hong Kong
Medical Image AnalysisSurgical Scene PerceptionMultimodal AI
Yaxin Hu
Yaxin Hu
Computer Science, University of Wisconsin - Madison
Human-Robot InteractionAccessibilityHCIConversational Agents
X
Xiaofan Ye
Neuromedical Centre, Hong Kong University Shenzhen Hospital, Shenzhen, China
Zelin Zang
Zelin Zang
Hong Kong Institute of Science & Innovation, Chinese Academy of Sciences
Deep Learning
Miao Xu
Miao Xu
Institute of Automation,Chinese Academy of Sciences
3D Face3D Body
Lihua Zhou
Lihua Zhou
Centre for Artificial Intelligence and Robotics, Hong Kong Institute of Science and Innovation, CAS
Machine LearningTransfer Learning
H
Huai Liao
Department of Respiratory Medicine, The First Affiliated Hospital of Sun Yat-sen University, Guangzhou, China
D
D. T. Chan
Department of Surgery, The Chinese University of Hong Kong, Hong Kong, China
M
Ming Feng
Department of Neurosurgery, China Pituitary Disease Registry Center, Peking Union Medical College Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
W
W. Poon
Neuromedical Centre, Hong Kong University Shenzhen Hospital, Shenzhen, China
Hongliang Ren
Hongliang Ren
Chinese University of Hong Kong | National University of Singapore | JHU/Harvard(RF) | CUHK(PhD)
Biorobotics & intelligent systemsmedical mechatronicscontinuumsoft flexible robots/sensorsmultisensory perception
Dong Yi
Dong Yi
Hong Kong Institute of Science and Innovation, Chinese Academy of Sciences
Computer VisionPattern Recognition
Nassir Navab
Nassir Navab
Professor of Computer Science, Technische Universität München
G
Gaofeng Meng
Center for Artificial Intelligence and Robotics, Hong Kong Institute of Science and Innovation, Chinese Academy of Sciences, Hong Kong, China
J
Jiebo Luo
Hong Kong Institute of Science and Innovation, Chinese Academy of Sciences, Hong Kong, China
Hongbin Liu
Hongbin Liu
Chinese Academy of Sciences; King's College London
AI and Medical roboticsemboided AIMLLM
Zhen Lei
Zhen Lei
Associate Professor, OSCO Research Chair in Off-site Construction
Offsite ConstructionConstruction Engineering and Management