🤖 AI Summary
This study addresses key challenges in surgical AI—subjective decision-making, data scarcity, and dynamic operating environments—by systematically evaluating 11 large vision-language models (VLMs) across 17 surgical visual understanding tasks spanning laparoscopic, robotic, and open procedures. We introduce the first cross-procedural, multi-task VLM benchmark for surgery and propose a context learning (ICL)-based zero- and few-shot inference paradigm, achieving up to 3× performance gains and markedly improving adaptability to real-world clinical dynamics. Experimental results show that VLMs outperform supervised models on static tasks (e.g., anatomical identification), demonstrating strong generalization; however, they remain limited in spatiotemporal reasoning. Our core contributions are: (1) establishing the first dedicated surgical VLM evaluation framework, and (2) empirically validating the feasibility and potential of ICL-driven VLMs for practical deployment in surgical AI systems.
📝 Abstract
Large Vision-Language Models offer a new paradigm for AI-driven image understanding, enabling models to perform tasks without task-specific training. This flexibility holds particular promise across medicine, where expert-annotated data is scarce. Yet, VLMs' practical utility in intervention-focused domains--especially surgery, where decision-making is subjective and clinical scenarios are variable--remains uncertain. Here, we present a comprehensive analysis of 11 state-of-the-art VLMs across 17 key visual understanding tasks in surgical AI--from anatomy recognition to skill assessment--using 13 datasets spanning laparoscopic, robotic, and open procedures. In our experiments, VLMs demonstrate promising generalizability, at times outperforming supervised models when deployed outside their training setting. In-context learning, incorporating examples during testing, boosted performance up to three-fold, suggesting adaptability as a key strength. Still, tasks requiring spatial or temporal reasoning remained difficult. Beyond surgery, our findings offer insights into VLMs' potential for tackling complex and dynamic scenarios in clinical and broader real-world applications.