More than Segmentation: Benchmarking SAM 3 for Segmentation, 3D Perception, and Reconstruction in Robotic Surgery

📅 2025-12-08

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This study presents the first systematic evaluation of SAM 3’s multimodal zero-shot segmentation and 3D perception capabilities in robot-assisted surgery. Addressing key challenges—including surgical instrument segmentation, dynamic video tracking, language-guided localization, and 2D-to-3D anatomical reconstruction—we benchmark SAM 3 across EndoVis, SCARED, StereoMIS, and EndoNeRF datasets using unified point, box, and text prompts to assess segmentation accuracy, monocular depth estimation, and instrument 3D reconstruction quality. Results demonstrate that SAM 3 significantly outperforms SAM and SAM 2 on EndoVis 2017/2018; exhibits strong language understanding and cross-modal generalization; achieves high-fidelity instrument reconstruction and robust depth prediction. However, tracking stability and structural consistency degrade under highly dynamic conditions with severe occlusions. This work extends the applicability boundary of foundation vision models in medical robotics and establishes a reproducible, multi-task benchmark for surgical AI.

Technology Category

Application Category

📝 Abstract

The recent Segment Anything Model (SAM) 3 has introduced significant advancements over its predecessor, SAM 2, particularly with the integration of language-based segmentation and enhanced 3D perception capabilities. SAM 3 supports zero-shot segmentation across a wide range of prompts, including point, bounding box, and language-based prompts, allowing for more flexible and intuitive interactions with the model. In this empirical evaluation, we assess the performance of SAM 3 in robot-assisted surgery, benchmarking its zero-shot segmentation with point and bounding box prompts and exploring its effectiveness in dynamic video tracking, alongside its newly introduced language prompt segmentation. While language prompts show potential, their performance in the surgical domain is currently suboptimal, highlighting the need for further domain-specific training. Additionally, we investigate SAM 3's 3D reconstruction abilities, demonstrating its capacity to process surgical scene data and reconstruct 3D anatomical structures from 2D images. Through comprehensive testing on the MICCAI EndoVis 2017 and EndoVis 2018 benchmarks, SAM 3 shows clear improvements over SAM and SAM 2 in both image and video segmentation under spatial prompts, while zero-shot evaluations on SCARED, StereoMIS, and EndoNeRF indicate strong monocular depth estimation and realistic 3D instrument reconstruction, yet also reveal remaining limitations in complex, highly dynamic surgical scenes.

Problem

Research questions and friction points this paper is trying to address.

Evaluates SAM 3's zero-shot segmentation in robotic surgery

Investigates SAM 3's 3D reconstruction from 2D surgical images

Benchmarks SAM 3's performance in dynamic surgical video tracking

Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-shot segmentation with diverse prompts

3D anatomical reconstruction from 2D images

Enhanced video tracking in surgical scenes

🔎 Similar Papers

Surgical SAM 2: Real-time Segment Anything in Surgical Video by Efficient Frame Pruning