A Comparative Study in Surgical AI: Datasets, Foundation Models, and Barriers to Med-AGI

📅 2026-03-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current AI models exhibit insufficient performance in surgical image analysis tasks such as neurosurgical instrument detection, falling short of clinical requirements. This study systematically evaluates the performance of state-of-the-art billion-parameter vision-language models—representative of 2026-level advancements—in surgical instrument detection scenarios and conducts scaling experiments varying model size and training duration. The findings reveal that merely increasing model scale or computational resources yields limited performance gains; instead, non-scalable factors such as annotation quality and domain-specific adaptation play a decisive role. Even with cutting-edge architectures and extensive training resources, improvements in instrument detection accuracy remain marginal, and consistent performance bottlenecks persist across diverse model architectures.
📝 Abstract
Recent Artificial Intelligence (AI) models have matched or exceeded human experts in several benchmarks of biomedical task performance, but have lagged behind on surgical image-analysis benchmarks. Since surgery requires integrating disparate tasks -- including multimodal data integration, human interaction, and physical effects -- generally-capable AI models could be particularly attractive as a collaborative tool if performance could be improved. On the one hand, the canonical approach of scaling architecture size and training data is attractive, especially since there are millions of hours of surgical video data generated per year. On the other hand, preparing surgical data for AI training requires significantly higher levels of professional expertise, and training on that data requires expensive computational resources. These trade-offs paint an uncertain picture of whether and to-what-extent modern AI could aid surgical practice. In this paper, we explore this question through a case study of surgical tool detection using state-of-the-art AI methods available in 2026. We demonstrate that even with multi-billion parameter models and extensive training, current Vision Language Models fall short in the seemingly simple task of tool detection in neurosurgery. Additionally, we show scaling experiments indicating that increasing model size and training time only leads to diminishing improvements in relevant performance metrics. Thus, our experiments suggest that current models could still face significant obstacles in surgical use cases. Moreover, some obstacles cannot be simply ``scaled away'' with additional compute and persist across diverse model architectures, raising the question of whether data and label availability are the only limiting factors. We discuss the main contributors to these constraints and advance potential solutions.
Problem

Research questions and friction points this paper is trying to address.

surgical AI
tool detection
Vision Language Models
scaling limitations
surgical image analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

surgical AI
vision-language models
scaling laws
tool detection
Med-AGI
🔎 Similar Papers
No similar papers found.
K
Kirill Skobelev
Center for Applied AI, Chicago Booth, Chicago, IL, USA
E
Eric Fithian
Center for Applied AI, Chicago Booth, Chicago, IL, USA
Y
Yegor Baranovski
Center for Applied AI, Chicago Booth, Chicago, IL, USA
Jack Cook
Jack Cook
MIT
artificial intelligencesystems for machine learning
S
Sandeep Angara
Surgical Data Science Collective, Washington D.C., USA
S
Shauna Otto
Surgical Data Science Science Collective, Washington D.C., USA
Z
Zhuang-Fang Yi
Surgical Data Science Collective, Washington D.C., USA
J
John Zhu
Surgical Data Science Collective, Washington D.C., USA
D
Daniel A. Donoho
Surgical Data Science Collective, Washington D.C., USA; Children’s National Hospital, Washington D.C., USA
X
X. Y. Han
Center for Applied AI, Chicago Booth, Chicago, IL, USA; Operations Management & Tolan Center for Healthcare, Chicago Booth, Chicago, IL, USA
N
Neeraj Mainkar
Surgical Data Science Collective, Washington D.C., USA
M
Margaux Masson-Forsythe
Surgical Data Science Collective, Washington D.C., USA