🤖 AI Summary
Existing approaches to surgical scene understanding typically treat procedural phase recognition, semantic reasoning, and visual localization in isolation, leading to fragmented representations and semantic inconsistencies. This work proposes SurgMLLM, the first framework to unify surgical scene understanding through a multimodal large language model, enabling end-to-end joint modeling of surgical phases, instrument–verb–target (IVT) triplets, and pixel-level segmentation to synergize high-level reasoning with low-level localization. The method incorporates structured semantic reasoning, a temporal aggregation prompting mechanism, and pixel-level mask supervision, and introduces a new dataset, CholecT45-Scene, to support comprehensive joint evaluation. Experiments demonstrate that SurgMLLM improves IVT triplet recognition average precision from 40.7% to 46.0% on CholecT45-Scene and consistently outperforms existing methods across phase recognition and segmentation tasks.
📝 Abstract
Surgical scene understanding is a cornerstone of computer-assisted intervention. While recent advances, particularly in surgical image segmentation, have driven progress, real-world clinical applications require a more holistic understanding that jointly captures procedural context, semantic reasoning, and precise visual grounding. However, existing approaches typically address these components in isolation, leading to fragmented representations and limited semantic consistency. To address this limitation, we propose SurgMLLM, a unified surgical scene understanding framework that bridges high-level reasoning and low-level visual grounding within a single model. Given surgical videos, SurgMLLM fine-tunes a multimodal large language model (MLLM) to support structured interpretability reasoning, which is used to jointly model phases, instrument-verb-target (IVT) triplets, and triplet-entity segmentation tokens. These tokens are then temporally aggregated and serve as prompts for a segmentation network, enabling accurate pixel-wise grounding of triplet instruments and targets. The entire framework is trained end-to-end with a unified objective that couples language-based reasoning supervision with visual grounding losses, promoting coherent cross-task learning and clinically consistent scene representations. To facilitate unified evaluation, we introduce CholecT45-Scene, extending CholecT45 dataset with 64,299 frames of pixel-level mask annotations for instruments and targets, aligned with existing triplet labels. Extensive experiments show that SurgMLLM significantly advances surgical scene understanding, improving the primary triplet recognition metric AP_IVT from 40.7% to 46.0% and consistently outperforming prior methods in phase recognition and segmentation. These results highlight the effectiveness of unified reasoning-and-grounding for reliable, context-aware surgical assistance.