SurgMLLMBench: A Multimodal Large Language Model Benchmark Dataset for Surgical Scene Understanding

📅 2025-11-26

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Existing surgical datasets suffer from inconsistent taxonomy definitions and lack pixel-level segmentation annotations, hindering consistent model evaluation and cross-domain generalization. To address this, we introduce SurgMLLMBench—the first multimodal large language model (MLLM) benchmark tailored for surgical scene understanding, covering laparoscopic, robot-assisted, and microsurgical scenarios. Its key contributions are: (1) a unified semantic taxonomy for surgical instruments; (2) integrated pixel-level segmentation masks with structured visual question-answering annotations; and (3) the first systematic application of MLLMs to cross-surgical-type scene understanding. Extensive experiments demonstrate that a single MLLM achieves strong zero-shot generalization on unseen surgical datasets. SurgMLLMBench thus establishes a reproducible, interactive, and domain-consistent evaluation benchmark for surgical AI research.

Technology Category

Application Category

📝 Abstract

Recent advances in multimodal large language models (LLMs) have highlighted their potential for medical and surgical applications. However, existing surgical datasets predominantly adopt a Visual Question Answering (VQA) format with heterogeneous taxonomies and lack support for pixel-level segmentation, limiting consistent evaluation and applicability. We present SurgMLLMBench, a unified multimodal benchmark explicitly designed for developing and evaluating interactive multimodal LLMs for surgical scene understanding, including the newly collected Micro-surgical Artificial Vascular anastomosIS (MAVIS) dataset. It integrates pixel-level instrument segmentation masks and structured VQA annotations across laparoscopic, robot-assisted, and micro-surgical domains under a unified taxonomy, enabling comprehensive evaluation beyond traditional VQA tasks and richer visual-conversational interactions. Extensive baseline experiments show that a single model trained on SurgMLLMBench achieves consistent performance across domains and generalizes effectively to unseen datasets. SurgMLLMBench will be publicly released as a robust resource to advance multimodal surgical AI research, supporting reproducible evaluation and development of interactive surgical reasoning models.

Problem

Research questions and friction points this paper is trying to address.

Addressing lack of unified multimodal benchmark for surgical AI evaluation

Integrating pixel-level segmentation with structured VQA annotations across surgical domains

Enabling comprehensive evaluation beyond traditional visual question answering tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified multimodal benchmark for surgical scene understanding

Integrates pixel-level segmentation and structured VQA annotations

Single model achieves cross-domain generalization on benchmark

🔎 Similar Papers

Procedure-Aware Surgical Video-language Pretraining with Hierarchical Knowledge Augmentation