Surg-R1: A Hierarchical Reasoning Foundation Model for Scalable and Interpretable Surgical Decision Support with Multi-Center Clinical Validation

📅 2026-03-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited explainable reasoning capability of existing surgical vision-language models and the inadequacy of general-purpose reasoning models in complex surgical scenarios due to insufficient domain knowledge. To bridge this gap, we propose a surgical vision-language foundation model featuring a three-tier hierarchical reasoning architecture. We design a four-stage training paradigm encompassing perception grounding, semantic understanding, and contextual reasoning, and curate the largest surgical chain-of-thought dataset to date, comprising 320,000 image-text pairs. Through supervised fine-tuning and group relative policy optimization, our model achieves a 64.9% Arena Score on SurgBench, outperforming Gemini 3.0 Pro and GPT-5.1, and demonstrates a 15.2-percentage-point improvement over the strongest baseline in multi-center clinical validation.

Technology Category

Application Category

📝 Abstract
Surgical scene understanding demands not only accurate predictions but also interpretable reasoning that surgeons can verify against clinical expertise. However, existing surgical vision-language models generate predictions without reasoning chains, and general-purpose reasoning models fail on compositional surgical tasks without domain-specific knowledge. We present Surg-R1, a surgical Vision-Language Model that addresses this gap through hierarchical reasoning trained via a four-stage pipeline. Our approach introduces three key contributions: (1) a three-level reasoning hierarchy decomposing surgical interpretation into perceptual grounding, relational understanding, and contextual reasoning; (2) the largest surgical chain-of-thought dataset with 320,000 reasoning pairs; and (3) a four-stage training pipeline progressing from supervised fine-tuning to group relative policy optimization and iterative self-improvement. Evaluation on SurgBench, comprising six public benchmarks and six multi-center external validation datasets from five institutions, demonstrates that Surg-R1 achieves the highest Arena Score (64.9%) on public benchmarks versus Gemini 3.0 Pro (46.1%) and GPT-5.1 (37.9%), outperforming both proprietary reasoning models and specialized surgical VLMs on the majority of tasks spanning instrument localization, triplet recognition, phase recognition, action recognition, and critical view of safety assessment, with a 15.2 percentage point improvement over the strongest surgical baseline on external validation.
Problem

Research questions and friction points this paper is trying to address.

surgical scene understanding
interpretable reasoning
vision-language models
compositional surgical tasks
clinical decision support
Innovation

Methods, ideas, or system contributions that make the work stand out.

hierarchical reasoning
surgical vision-language model
chain-of-thought dataset
multi-center validation
interpretable decision support
🔎 Similar Papers
Jian Jiang
Jian Jiang
Shanghai Jiaotong University
Vision-Language Model
Chenxi Lin
Chenxi Lin
yitutech.com
Yiming Gu
Yiming Gu
Google
Artificial IntelligenceMachine LearningTransportation Engineering
Zengyi Qin
Zengyi Qin
Massachusetts Institute of Technology
Multi-modal LLMs and Agents
Zhitao Zeng
Zhitao Zeng
National University of Singapore
Vision-Language Models
Kun Yuan
Kun Yuan
University of Strasbourg & Technical University of Munich
surgical data sciencemulti-modal learning
Y
Yonghao Long
Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong SAR, China
X
Xiang Xia
Department of Gastrointestinal Surgery, Renji Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China
Cheng Yuan
Cheng Yuan
Associate Professor, School of Mathematics and Statistics, Central China Normal University
Computational PhysicsDeep Learning
Yuqi Wang
Yuqi Wang
Shanghai Jiao Tong University
Z
Zijie Yue
College of Electronic and Information Engineering, Tongji University, Shanghai, China
K
Kunyi Yang
Global College, Shanghai Jiao Tong University, Shanghai, China
Yuting Zhang
Yuting Zhang
HKUST(GZ)
rPPGComputer Vision
Zhu Zhuo
Zhu Zhuo
National University of Singapore
Surgical Data ScienceMultimodal Large Language Model
Dian Qin
Dian Qin
ChengDu Withai Innovations Technology Company; Zhejiang University
Computer VisionMedical ImagingKnowledge Distillation
X
Xin Wang
Division of Pancreatic Surgery, Department of General Surgery, West China Hospital of Sichuan University, Chengdu, China
N
NG Chi Fai
Department of Surgery, The Chinese University of Hong Kong, Hong Kong SAR, China
B
Brian Anthony
Institute for Medical Engineering & Science, Massachusetts Institute of Technology, Cambridge, MA, USA
Daguang Xu
Daguang Xu
Senior Research Manager at NVIDIA
Deep LearningMachine LearningMedical Image AnalysisCompressive SensingSparse coding
Guy Rosman
Guy Rosman
Toyota Research Institute; Massachusetts General Hospital; Duke Surgery
Computer vision and robotic perceptionBayesian inferencetrajectory prediction
O
Ozanan Meireles
Massachusetts General Hospital, Massachusetts, US
Z
Zizhen Zhang
Department of Gastrointestinal Surgery, Renji Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China
Nicolas Padoy
Nicolas Padoy
Professor of Computer Science, University of Strasbourg
Surgical Data ScienceMedical Image AnalysisComputer VisionMachine Learning
H
Hesheng Wang
School of Automation and Intelligent Sensing, Shanghai Jiao Tong University
Q
Qi Dou
Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong SAR, China