SurgBench: A Unified Large-Scale Benchmark for Surgical Video Analysis

📅 2025-06-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current foundation models (FMs) for surgical video analysis are hindered by the lack of large-scale, diverse, and standardized pretraining and evaluation resources. To address this, we introduce SurgBench—the first unified surgical video analysis benchmark—comprising (1) SurgBench-P, a pretraining dataset of 53 million frames spanning 22 surgical procedures, and (2) SurgBench-E, an evaluation benchmark with 72 fine-grained tasks across six analytical dimensions. Our key contribution is the first standardized framework enabling cross-procedure and cross-modal generalization assessment, integrating multi-source real-world surgical videos, granular task taxonomy, and a unified evaluation protocol. Pretraining on SurgBench-P significantly improves the performance of mainstream video FMs on surgical tasks, particularly enhancing zero-shot and few-shot transfer capabilities to unseen procedures and modalities. SurgBench thus fills a critical gap in large-scale, standardized benchmarking for surgical video understanding.

Technology Category

Application Category

📝 Abstract
Surgical video understanding is pivotal for enabling automated intraoperative decision-making, skill assessment, and postoperative quality improvement. However, progress in developing surgical video foundation models (FMs) remains hindered by the scarcity of large-scale, diverse datasets for pretraining and systematic evaluation. In this paper, we introduce extbf{SurgBench}, a unified surgical video benchmarking framework comprising a pretraining dataset, extbf{SurgBench-P}, and an evaluation benchmark, extbf{SurgBench-E}. SurgBench offers extensive coverage of diverse surgical scenarios, with SurgBench-P encompassing 53 million frames across 22 surgical procedures and 11 specialties, and SurgBench-E providing robust evaluation across six categories (phase classification, camera motion, tool recognition, disease diagnosis, action classification, and organ detection) spanning 72 fine-grained tasks. Extensive experiments reveal that existing video FMs struggle to generalize across varied surgical video analysis tasks, whereas pretraining on SurgBench-P yields substantial performance improvements and superior cross-domain generalization to unseen procedures and modalities. Our dataset and code are available upon request.
Problem

Research questions and friction points this paper is trying to address.

Addressing scarcity of large-scale surgical video datasets for pretraining
Providing unified benchmarking for diverse surgical video analysis tasks
Improving generalization of video foundation models in surgery
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale surgical video benchmarking framework
Diverse pretraining dataset with 53M frames
Robust evaluation across 72 fine-grained tasks
🔎 Similar Papers
No similar papers found.
J
Jianhui Wei
Zhejiang University, Zhejiang Key Laboratory of Medical Imaging Artificial Intelligence
Z
Zikai Xiao
Zhejiang University, Zhejiang Key Laboratory of Medical Imaging Artificial Intelligence
Danyu Sun
Danyu Sun
Zhejiang University / University of Illinois Urbana Champaign
L
Luqi Gong
Zhejiang Lab
Z
Zongxin Yang
Harvard University
Zuozhu Liu
Zuozhu Liu
Assistant Professor, Zhejiang University/University of Illinois Urbana-Champaign
deep learningvision-language modelsmedical AI
J
Jian Wu
Zhejiang University, Zhejiang Key Laboratory of Medical Imaging Artificial Intelligence