SUREON: A Benchmark and Vision-Language-Model for Surgical Reasoning

📅 2026-03-06

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

Current surgical AI systems lack the capacity for deep reasoning about intraoperative decision intent, risk assessment, and future actions, primarily due to the scarcity of high-quality surgical reasoning data. This work presents the first systematic effort to extract expert commentary from unstructured surgical instructional videos, resulting in SUREON—a large-scale video question-answering dataset encompassing twelve categories of reasoning questions—and introduces SureonVLM, a specialized vision-language model. Leveraging a multi-agent automated annotation pipeline and a training strategy combining supervised fine-tuning with Group Relative Policy Optimization, SureonVLM achieves over 84% accuracy on the SUREON benchmark, substantially outperforming general-purpose large models while also excelling in standard surgical perception tasks and explicitly inferring procedural intent. This study establishes the first surgical reasoning benchmark and dedicated architecture, advancing surgical AI toward higher-order cognition.

Technology Category

Application Category

📝 Abstract

Surgeons don't just see -- they interpret. When an expert observes a surgical scene, they understand not only what instrument is being used, but why it was chosen, what risk it poses, and what comes next. Current surgical AI cannot answer such questions, largely because training data that explicitly encodes surgical reasoning is immensely difficult to annotate at scale. Yet surgical video lectures already contain exactly this -- explanations of intent, rationale, and anticipation, narrated by experts for the purpose of teaching. Though inherently noisy and unstructured, these narrations encode the reasoning that surgical AI currently lacks. We introduce SUREON, a large-scale video QA dataset that systematically harvests this training signal from surgical academic videos. SUREON defines 12 question categories covering safety assessment, decision rationale, and forecasting, and uses a multi-agent pipeline to extract and structure supervision at scale. Across 134.7K clips and 170 procedure types, SUREON yields 206.8k QA pairs and an expert-validated benchmark of 354 examples. To evaluate the extent to which this supervision translates to surgical reasoning ability, we introduce two models: SureonVLM, a vision-language model adapted through supervised fine-tuning, and SureonVLM-R1, a reasoning model trained with Group Relative Policy Optimization. Both models can answer complex questions about surgery and substantially outperform larger general-domain models, exceeding 84% accuracy on the SUREON benchmark while outperforming general-domain models on standard surgical perception tasks. Qualitative analysis of SureonVLM-R1 reveals explicit reasoning behavior, such as inferring operative intent from visual context.

Problem

Research questions and friction points this paper is trying to address.

surgical reasoning

vision-language model

surgical AI

reasoning benchmark

surgical video understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

surgical reasoning

vision-language model

video QA dataset