SAM2S: Segment Anything in Surgical Videos via Semantic Long-term Tracking

📅 2025-11-20

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

To address the performance degradation of interactive video object segmentation (iVOS) in surgical videos—caused by domain shift and insufficient long-term temporal tracking—this work introduces SA-SV, the first large-scale surgical video segmentation benchmark. We propose DiveMem, a novel memory mechanism enabling diverse temporal memory modeling, integrated with temporal semantic learning and anti-blur learning to enhance inter-frame consistency and robustness. Built upon the SAM2 framework, our approach incorporates a trainable memory module, multi-source consistency optimization, and zero-shot transfer capabilities. Fine-tuned on SA-SV, SAM2 achieves a 12.99-point gain in average J&F score; the lightweight variant SAM2S attains 80.42 (surpassing baselines by 17.10 and 4.11 points), while sustaining real-time inference at 68 FPS. The method significantly improves generalization and practicality in complex, category-agnostic surgical scenarios.

Technology Category

Application Category

📝 Abstract

Surgical video segmentation is crucial for computer-assisted surgery, enabling precise localization and tracking of instruments and tissues. Interactive Video Object Segmentation (iVOS) models such as Segment Anything Model 2 (SAM2) provide prompt-based flexibility beyond methods with predefined categories, but face challenges in surgical scenarios due to the domain gap and limited long-term tracking. To address these limitations, we construct SA-SV, the largest surgical iVOS benchmark with instance-level spatio-temporal annotations (masklets) spanning eight procedure types (61k frames, 1.6k masklets), enabling comprehensive development and evaluation for long-term tracking and zero-shot generalization. Building on SA-SV, we propose SAM2S, a foundation model enhancing extbf{SAM2} for extbf{S}urgical iVOS through: (1) DiveMem, a trainable diverse memory mechanism for robust long-term tracking; (2) temporal semantic learning for instrument understanding; and (3) ambiguity-resilient learning to mitigate annotation inconsistencies across multi-source datasets. Extensive experiments demonstrate that fine-tuning on SA-SV enables substantial performance gains, with SAM2 improving by 12.99 average $mathcal{J}$&$mathcal{F}$ over vanilla SAM2. SAM2S further advances performance to 80.42 average $mathcal{J}$&$mathcal{F}$, surpassing vanilla and fine-tuned SAM2 by 17.10 and 4.11 points respectively, while maintaining 68 FPS real-time inference and strong zero-shot generalization. Code and dataset will be released at https://jinlab-imvr.github.io/SAM2S.

Problem

Research questions and friction points this paper is trying to address.

Addressing surgical video segmentation challenges with long-term tracking

Overcoming domain gap limitations in surgical interactive video segmentation

Improving instrument and tissue tracking consistency across surgical procedures

Innovation

Methods, ideas, or system contributions that make the work stand out.

Diverse memory mechanism for long-term tracking

Temporal semantic learning for instrument understanding

Ambiguity-resilient learning for annotation inconsistencies

🔎 Similar Papers

Surgical SAM 2: Real-time Segment Anything in Surgical Video by Efficient Frame Pruning