SurgRAW: Multi-Agent Workflow with Chain-of-Thought Reasoning for Surgical Intelligence

📅 2025-03-13

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

To address clinical reliability bottlenecks in vision-language models (VLMs) for surgical intelligence—including frequent hallucinations, domain knowledge gaps, and insufficient task-dependent modeling—this paper proposes a chain-of-thought (CoT)-guided multi-agent collaborative reasoning framework. We innovatively design a hierarchical multi-agent system coupled with a panel-discussion mechanism to ensure cross-task logical consistency; further, we integrate domain-customized CoT prompting with retrieval-augmented generation (RAG) to bridge surgical knowledge deficits and suppress hallucinations. Evaluated on our newly constructed SurgCoTBench benchmark and 12 real-world surgical procedures, the framework achieves a 29.32% accuracy improvement over baseline VLMs—reaching state-of-the-art performance—while significantly reducing hallucination rates. Moreover, its reasoning process exhibits enhanced interpretability and clinical credibility.

Technology Category

Application Category

📝 Abstract

Integration of Vision-Language Models (VLMs) in surgical intelligence is hindered by hallucinations, domain knowledge gaps, and limited understanding of task interdependencies within surgical scenes, undermining clinical reliability. While recent VLMs demonstrate strong general reasoning and thinking capabilities, they still lack the domain expertise and task-awareness required for precise surgical scene interpretation. Although Chain-of-Thought (CoT) can structure reasoning more effectively, current approaches rely on self-generated CoT steps, which often exacerbate inherent domain gaps and hallucinations. To overcome this, we present SurgRAW, a CoT-driven multi-agent framework that delivers transparent, interpretable insights for most tasks in robotic-assisted surgery. By employing specialized CoT prompts across five tasks: instrument recognition, action recognition, action prediction, patient data extraction, and outcome assessment, SurgRAW mitigates hallucinations through structured, domain-aware reasoning. Retrieval-Augmented Generation (RAG) is also integrated to external medical knowledge to bridge domain gaps and improve response reliability. Most importantly, a hierarchical agentic system ensures that CoT-embedded VLM agents collaborate effectively while understanding task interdependencies, with a panel discussion mechanism promotes logical consistency. To evaluate our method, we introduce SurgCoTBench, the first reasoning-based dataset with structured frame-level annotations. With comprehensive experiments, we demonstrate the effectiveness of proposed SurgRAW with 29.32% accuracy improvement over baseline VLMs on 12 robotic procedures, achieving the state-of-the-art performance and advancing explainable, trustworthy, and autonomous surgical assistance.

Problem

Research questions and friction points this paper is trying to address.

Addresses hallucinations and domain gaps in Vision-Language Models for surgery.

Enhances task-awareness and domain expertise in surgical scene interpretation.

Improves reliability and accuracy in robotic-assisted surgical tasks.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent framework with Chain-of-Thought reasoning

Integration of Retrieval-Augmented Generation for domain knowledge

Hierarchical agentic system for task interdependency understanding

🔎 Similar Papers

From Decision to Action in Surgical Autonomy: Multi-Modal Large Language Models for Robot-Assisted Blood Suction