Surg-R1: A Hierarchical Reasoning Foundation Model for Scalable and Interpretable Surgical Decision Support with Multi-Center Clinical Validation

📅 2026-03-12

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

This work addresses the limited explainable reasoning capability of existing surgical vision-language models and the inadequacy of general-purpose reasoning models in complex surgical scenarios due to insufficient domain knowledge. To bridge this gap, we propose a surgical vision-language foundation model featuring a three-tier hierarchical reasoning architecture. We design a four-stage training paradigm encompassing perception grounding, semantic understanding, and contextual reasoning, and curate the largest surgical chain-of-thought dataset to date, comprising 320,000 image-text pairs. Through supervised fine-tuning and group relative policy optimization, our model achieves a 64.9% Arena Score on SurgBench, outperforming Gemini 3.0 Pro and GPT-5.1, and demonstrates a 15.2-percentage-point improvement over the strongest baseline in multi-center clinical validation.

Technology Category

Application Category

📝 Abstract

Surgical scene understanding demands not only accurate predictions but also interpretable reasoning that surgeons can verify against clinical expertise. However, existing surgical vision-language models generate predictions without reasoning chains, and general-purpose reasoning models fail on compositional surgical tasks without domain-specific knowledge. We present Surg-R1, a surgical Vision-Language Model that addresses this gap through hierarchical reasoning trained via a four-stage pipeline. Our approach introduces three key contributions: (1) a three-level reasoning hierarchy decomposing surgical interpretation into perceptual grounding, relational understanding, and contextual reasoning; (2) the largest surgical chain-of-thought dataset with 320,000 reasoning pairs; and (3) a four-stage training pipeline progressing from supervised fine-tuning to group relative policy optimization and iterative self-improvement. Evaluation on SurgBench, comprising six public benchmarks and six multi-center external validation datasets from five institutions, demonstrates that Surg-R1 achieves the highest Arena Score (64.9%) on public benchmarks versus Gemini 3.0 Pro (46.1%) and GPT-5.1 (37.9%), outperforming both proprietary reasoning models and specialized surgical VLMs on the majority of tasks spanning instrument localization, triplet recognition, phase recognition, action recognition, and critical view of safety assessment, with a 15.2 percentage point improvement over the strongest surgical baseline on external validation.

Problem

Research questions and friction points this paper is trying to address.

surgical scene understanding

interpretable reasoning

vision-language models

compositional surgical tasks

clinical decision support

Innovation

Methods, ideas, or system contributions that make the work stand out.

hierarchical reasoning

surgical vision-language model

chain-of-thought dataset