🤖 AI Summary
Existing surgical datasets suffer from inconsistent taxonomy definitions and lack pixel-level segmentation annotations, hindering consistent model evaluation and cross-domain generalization. To address this, we introduce SurgMLLMBench—the first multimodal large language model (MLLM) benchmark tailored for surgical scene understanding, covering laparoscopic, robot-assisted, and microsurgical scenarios. Its key contributions are: (1) a unified semantic taxonomy for surgical instruments; (2) integrated pixel-level segmentation masks with structured visual question-answering annotations; and (3) the first systematic application of MLLMs to cross-surgical-type scene understanding. Extensive experiments demonstrate that a single MLLM achieves strong zero-shot generalization on unseen surgical datasets. SurgMLLMBench thus establishes a reproducible, interactive, and domain-consistent evaluation benchmark for surgical AI research.
📝 Abstract
Recent advances in multimodal large language models (LLMs) have highlighted their potential for medical and surgical applications. However, existing surgical datasets predominantly adopt a Visual Question Answering (VQA) format with heterogeneous taxonomies and lack support for pixel-level segmentation, limiting consistent evaluation and applicability. We present SurgMLLMBench, a unified multimodal benchmark explicitly designed for developing and evaluating interactive multimodal LLMs for surgical scene understanding, including the newly collected Micro-surgical Artificial Vascular anastomosIS (MAVIS) dataset. It integrates pixel-level instrument segmentation masks and structured VQA annotations across laparoscopic, robot-assisted, and micro-surgical domains under a unified taxonomy, enabling comprehensive evaluation beyond traditional VQA tasks and richer visual-conversational interactions. Extensive baseline experiments show that a single model trained on SurgMLLMBench achieves consistent performance across domains and generalizes effectively to unseen datasets. SurgMLLMBench will be publicly released as a robust resource to advance multimodal surgical AI research, supporting reproducible evaluation and development of interactive surgical reasoning models.