SurgVidLM: Towards Multi-grained Surgical Video Understanding with Large Language Model

📅 2025-06-21

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

The absence of specialized video large language models (Vid-LLMs) tailored for fine-grained surgical video understanding hinders comprehensive multimodal analysis in surgical AI. Method: This work introduces the first Vid-LLM framework designed explicitly for multi-granularity surgical understanding. It comprises: (1) SVU-31K, a large-scale surgical video understanding dataset; (2) StageFocus, a two-stage architecture enabling progressive modeling—from global procedural flow to local operative actions; and (3) a multi-frequency fusion attention mechanism that jointly encodes low-frequency semantic tokens and high-frequency detailed visual tokens. Contribution/Results: Extensive experiments demonstrate consistent and significant improvements over state-of-the-art methods across both coarse- and fine-grained surgical understanding tasks. Notably, the model excels in complex surgical context awareness and precise key-action localization, exhibiting superior semantic parsing capability grounded in spatiotemporal and hierarchical visual-language alignment.

Technology Category

Application Category

📝 Abstract

Recent advances in Multimodal Large Language Models have demonstrated great potential in the medical domain, facilitating users to understand surgical scenes and procedures. Beyond image-based methods, the exploration of Video Large Language Models (Vid-LLMs) has emerged as a promising avenue for capturing the complex sequences of information involved in surgery. However, there is still a lack of Vid-LLMs specialized for fine-grained surgical video understanding tasks, which is crucial for analyzing specific processes or details within a surgical procedure. To bridge this gap, we propose SurgVidLM, the first video language model designed to address both full and fine-grained surgical video comprehension. To train our SurgVidLM, we construct the SVU-31K dataset which consists of over 31K video-instruction pairs, enabling both holistic understanding and detailed analysis of surgical procedures. Furthermore, we introduce the StageFocus mechanism which is a two-stage framework performing the multi-grained, progressive understanding of surgical videos. We also develop the Multi-frequency Fusion Attention to effectively integrate low and high-frequency visual tokens, ensuring the retention of critical information. Experimental results demonstrate that SurgVidLM significantly outperforms state-of-the-art Vid-LLMs in both full and fine-grained video understanding tasks, showcasing its superior capability in capturing complex procedural contexts.

Problem

Research questions and friction points this paper is trying to address.

Lack of specialized Vid-LLMs for surgical video understanding

Need for fine-grained analysis of surgical procedures

Challenges in capturing complex surgical sequences

Innovation

Methods, ideas, or system contributions that make the work stand out.

First Vid-LLM for surgical video comprehension

SVU-31K dataset enables detailed surgical analysis

StageFocus and Multi-frequency Fusion Attention mechanisms

🔎 Similar Papers

Procedure-Aware Surgical Video-language Pretraining with Hierarchical Knowledge Augmentation