SurgVLM: A Large Vision-Language Model and Systematic Evaluation Benchmark for Surgical Intelligence

📅 2025-06-03

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Existing general-purpose vision-language models (VLMs) lack surgical-domain supervision and high-quality multimodal data, limiting their capability in surgical visual perception, temporal modeling, and high-level reasoning. To address this, we introduce SurgVLM—the first foundation model tailored for surgery—and its companion benchmark, SurgVLM-Bench. Our contributions are threefold: (1) the release of SurgVLM-DB, a large-scale surgical multimodal dataset comprising 1.81M video frames and 7.79M dialogues across 16 procedures and 18 anatomical structures; (2) a hierarchical vision-language alignment paradigm; and (3) SurgVLM-Bench, the first standardized multimodal evaluation benchmark for surgery. Built upon Qwen2.5-VL, SurgVLM integrates 23 public datasets under a unified annotation schema and undergoes instruction-tuning on 10+ surgical tasks. The SurgVLM-7B/32B/72B variants comprehensively outperform 14 state-of-the-art VLMs—including GPT-4o—on SurgVLM-Bench, achieving significant gains in surgical visual question answering, procedural step localization, and anomaly detection.

Technology Category

Application Category

📝 Abstract

Foundation models have achieved transformative success across biomedical domains by enabling holistic understanding of multimodal data. However, their application in surgery remains underexplored. Surgical intelligence presents unique challenges - requiring surgical visual perception, temporal analysis, and reasoning. Existing general-purpose vision-language models fail to address these needs due to insufficient domain-specific supervision and the lack of a large-scale high-quality surgical database. To bridge this gap, we propose SurgVLM, one of the first large vision-language foundation models for surgical intelligence, where this single universal model can tackle versatile surgical tasks. To enable this, we construct a large-scale multimodal surgical database, SurgVLM-DB, comprising over 1.81 million frames with 7.79 million conversations, spanning more than 16 surgical types and 18 anatomical structures. We unify and reorganize 23 public datasets across 10 surgical tasks, followed by standardizing labels and doing hierarchical vision-language alignment to facilitate comprehensive coverage of gradually finer-grained surgical tasks, from visual perception, temporal analysis, to high-level reasoning. Building upon this comprehensive dataset, we propose SurgVLM, which is built upon Qwen2.5-VL, and undergoes instruction tuning to 10+ surgical tasks. We further construct a surgical multimodal benchmark, SurgVLM-Bench, for method evaluation. SurgVLM-Bench consists of 6 popular and widely-used datasets in surgical domain, covering several crucial downstream tasks. Based on SurgVLM-Bench, we evaluate the performance of our SurgVLM (3 SurgVLM variants: SurgVLM-7B, SurgVLM-32B, and SurgVLM-72B), and conduct comprehensive comparisons with 14 mainstream commercial VLMs (e.g., GPT-4o, Gemini 2.0 Flash, Qwen2.5-Max).

Problem

Research questions and friction points this paper is trying to address.

Addressing lack of domain-specific surgical vision-language models

Overcoming insufficient large-scale surgical multimodal databases

Unifying diverse surgical tasks into a single universal model

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large vision-language model for surgical intelligence

Multimodal surgical database with hierarchical alignment

Instruction tuning for diverse surgical tasks

🔎 Similar Papers

Surgical-LVLM: Learning to Adapt Large Vision-Language Model for Grounded Visual Question Answering in Robotic Surgery