Systematic Evaluation of Large Vision-Language Models for Surgical Artificial Intelligence

📅 2025-04-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses key challenges in surgical AI—subjective decision-making, data scarcity, and dynamic operating environments—by systematically evaluating 11 large vision-language models (VLMs) across 17 surgical visual understanding tasks spanning laparoscopic, robotic, and open procedures. We introduce the first cross-procedural, multi-task VLM benchmark for surgery and propose a context learning (ICL)-based zero- and few-shot inference paradigm, achieving up to 3× performance gains and markedly improving adaptability to real-world clinical dynamics. Experimental results show that VLMs outperform supervised models on static tasks (e.g., anatomical identification), demonstrating strong generalization; however, they remain limited in spatiotemporal reasoning. Our core contributions are: (1) establishing the first dedicated surgical VLM evaluation framework, and (2) empirically validating the feasibility and potential of ICL-driven VLMs for practical deployment in surgical AI systems.

Technology Category

Application Category

📝 Abstract
Large Vision-Language Models offer a new paradigm for AI-driven image understanding, enabling models to perform tasks without task-specific training. This flexibility holds particular promise across medicine, where expert-annotated data is scarce. Yet, VLMs' practical utility in intervention-focused domains--especially surgery, where decision-making is subjective and clinical scenarios are variable--remains uncertain. Here, we present a comprehensive analysis of 11 state-of-the-art VLMs across 17 key visual understanding tasks in surgical AI--from anatomy recognition to skill assessment--using 13 datasets spanning laparoscopic, robotic, and open procedures. In our experiments, VLMs demonstrate promising generalizability, at times outperforming supervised models when deployed outside their training setting. In-context learning, incorporating examples during testing, boosted performance up to three-fold, suggesting adaptability as a key strength. Still, tasks requiring spatial or temporal reasoning remained difficult. Beyond surgery, our findings offer insights into VLMs' potential for tackling complex and dynamic scenarios in clinical and broader real-world applications.
Problem

Research questions and friction points this paper is trying to address.

Evaluating VLMs for surgical AI tasks without task-specific training
Assessing VLMs' adaptability in variable surgical decision-making scenarios
Identifying VLMs' limitations in spatial and temporal reasoning tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Vision-Language Models for surgical AI
In-context learning boosts performance significantly
Generalizability across diverse surgical tasks
🔎 Similar Papers
No similar papers found.
Anita Rau
Anita Rau
Postdoc at Stanford University
Computer VisionMachine Learning
M
Mark Endo
Stanford University
Josiah Aklilu
Josiah Aklilu
PhD student, Stanford University
Artificial IntelligenceComputer vision
J
Jaewoo Heo
Stanford University
K
Khaled Saab
Google DeepMind
A
Alberto Paderno
Humanitas University
J
Jeffrey K. Jopling
Johns Hopkins University
F
F. C. Holsinger
Stanford University
S
S. Yeung-Levy
Stanford University