SurgMLLMBench: A Multimodal Large Language Model Benchmark Dataset for Surgical Scene Understanding

📅 2025-11-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing surgical datasets suffer from inconsistent taxonomy definitions and lack pixel-level segmentation annotations, hindering consistent model evaluation and cross-domain generalization. To address this, we introduce SurgMLLMBench—the first multimodal large language model (MLLM) benchmark tailored for surgical scene understanding, covering laparoscopic, robot-assisted, and microsurgical scenarios. Its key contributions are: (1) a unified semantic taxonomy for surgical instruments; (2) integrated pixel-level segmentation masks with structured visual question-answering annotations; and (3) the first systematic application of MLLMs to cross-surgical-type scene understanding. Extensive experiments demonstrate that a single MLLM achieves strong zero-shot generalization on unseen surgical datasets. SurgMLLMBench thus establishes a reproducible, interactive, and domain-consistent evaluation benchmark for surgical AI research.

Technology Category

Application Category

📝 Abstract
Recent advances in multimodal large language models (LLMs) have highlighted their potential for medical and surgical applications. However, existing surgical datasets predominantly adopt a Visual Question Answering (VQA) format with heterogeneous taxonomies and lack support for pixel-level segmentation, limiting consistent evaluation and applicability. We present SurgMLLMBench, a unified multimodal benchmark explicitly designed for developing and evaluating interactive multimodal LLMs for surgical scene understanding, including the newly collected Micro-surgical Artificial Vascular anastomosIS (MAVIS) dataset. It integrates pixel-level instrument segmentation masks and structured VQA annotations across laparoscopic, robot-assisted, and micro-surgical domains under a unified taxonomy, enabling comprehensive evaluation beyond traditional VQA tasks and richer visual-conversational interactions. Extensive baseline experiments show that a single model trained on SurgMLLMBench achieves consistent performance across domains and generalizes effectively to unseen datasets. SurgMLLMBench will be publicly released as a robust resource to advance multimodal surgical AI research, supporting reproducible evaluation and development of interactive surgical reasoning models.
Problem

Research questions and friction points this paper is trying to address.

Addressing lack of unified multimodal benchmark for surgical AI evaluation
Integrating pixel-level segmentation with structured VQA annotations across surgical domains
Enabling comprehensive evaluation beyond traditional visual question answering tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified multimodal benchmark for surgical scene understanding
Integrates pixel-level segmentation and structured VQA annotations
Single model achieves cross-domain generalization on benchmark
🔎 Similar Papers
No similar papers found.
T
Tae-Min Choi
Samsung Research
T
Tae Kyeong Jeong
Center for Humanoid Research, Korea Institute of Science and Technology
G
Garam Kim
Center for Humanoid Research, Korea Institute of Science and Technology
Jaemin Lee
Jaemin Lee
Department of plastic surgery, College of medicine, Korea University
Y
Yeongyoon Koh
Department of orthopedic surgery, College of medicine, Korea University
I
In Cheul Choi
Department of orthopedic surgery, College of medicine, Korea University
J
Jae-Ho Chung
Department of plastic surgery, College of medicine, Korea University
J
Jong Woong Park
Department of orthopedic surgery, College of medicine, Korea University
Juyoun Park
Juyoun Park
Senior Researcher at Korea Institute of Science and Technology (KIST)
Machine LearningArtificial IntelligenceRobotics