SurgBench: A Unified Large-Scale Benchmark for Surgical Video Analysis

📅 2025-06-09

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Current foundation models (FMs) for surgical video analysis are hindered by the lack of large-scale, diverse, and standardized pretraining and evaluation resources. To address this, we introduce SurgBench—the first unified surgical video analysis benchmark—comprising (1) SurgBench-P, a pretraining dataset of 53 million frames spanning 22 surgical procedures, and (2) SurgBench-E, an evaluation benchmark with 72 fine-grained tasks across six analytical dimensions. Our key contribution is the first standardized framework enabling cross-procedure and cross-modal generalization assessment, integrating multi-source real-world surgical videos, granular task taxonomy, and a unified evaluation protocol. Pretraining on SurgBench-P significantly improves the performance of mainstream video FMs on surgical tasks, particularly enhancing zero-shot and few-shot transfer capabilities to unseen procedures and modalities. SurgBench thus fills a critical gap in large-scale, standardized benchmarking for surgical video understanding.

Technology Category

Application Category

📝 Abstract

Surgical video understanding is pivotal for enabling automated intraoperative decision-making, skill assessment, and postoperative quality improvement. However, progress in developing surgical video foundation models (FMs) remains hindered by the scarcity of large-scale, diverse datasets for pretraining and systematic evaluation. In this paper, we introduce extbf{SurgBench}, a unified surgical video benchmarking framework comprising a pretraining dataset, extbf{SurgBench-P}, and an evaluation benchmark, extbf{SurgBench-E}. SurgBench offers extensive coverage of diverse surgical scenarios, with SurgBench-P encompassing 53 million frames across 22 surgical procedures and 11 specialties, and SurgBench-E providing robust evaluation across six categories (phase classification, camera motion, tool recognition, disease diagnosis, action classification, and organ detection) spanning 72 fine-grained tasks. Extensive experiments reveal that existing video FMs struggle to generalize across varied surgical video analysis tasks, whereas pretraining on SurgBench-P yields substantial performance improvements and superior cross-domain generalization to unseen procedures and modalities. Our dataset and code are available upon request.

Problem

Research questions and friction points this paper is trying to address.

Addressing scarcity of large-scale surgical video datasets for pretraining

Providing unified benchmarking for diverse surgical video analysis tasks

Improving generalization of video foundation models in surgery

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale surgical video benchmarking framework

Diverse pretraining dataset with 53M frames

Robust evaluation across 72 fine-grained tasks

🔎 Similar Papers

No similar papers found.