🤖 AI Summary
To address the performance degradation of interactive video object segmentation (iVOS) in surgical videos—caused by domain shift and insufficient long-term temporal tracking—this work introduces SA-SV, the first large-scale surgical video segmentation benchmark. We propose DiveMem, a novel memory mechanism enabling diverse temporal memory modeling, integrated with temporal semantic learning and anti-blur learning to enhance inter-frame consistency and robustness. Built upon the SAM2 framework, our approach incorporates a trainable memory module, multi-source consistency optimization, and zero-shot transfer capabilities. Fine-tuned on SA-SV, SAM2 achieves a 12.99-point gain in average J&F score; the lightweight variant SAM2S attains 80.42 (surpassing baselines by 17.10 and 4.11 points), while sustaining real-time inference at 68 FPS. The method significantly improves generalization and practicality in complex, category-agnostic surgical scenarios.
📝 Abstract
Surgical video segmentation is crucial for computer-assisted surgery, enabling precise localization and tracking of instruments and tissues. Interactive Video Object Segmentation (iVOS) models such as Segment Anything Model 2 (SAM2) provide prompt-based flexibility beyond methods with predefined categories, but face challenges in surgical scenarios due to the domain gap and limited long-term tracking. To address these limitations, we construct SA-SV, the largest surgical iVOS benchmark with instance-level spatio-temporal annotations (masklets) spanning eight procedure types (61k frames, 1.6k masklets), enabling comprehensive development and evaluation for long-term tracking and zero-shot generalization. Building on SA-SV, we propose SAM2S, a foundation model enhancing extbf{SAM2} for extbf{S}urgical iVOS through: (1) DiveMem, a trainable diverse memory mechanism for robust long-term tracking; (2) temporal semantic learning for instrument understanding; and (3) ambiguity-resilient learning to mitigate annotation inconsistencies across multi-source datasets. Extensive experiments demonstrate that fine-tuning on SA-SV enables substantial performance gains, with SAM2 improving by 12.99 average $mathcal{J}$&$mathcal{F}$ over vanilla SAM2. SAM2S further advances performance to 80.42 average $mathcal{J}$&$mathcal{F}$, surpassing vanilla and fine-tuned SAM2 by 17.10 and 4.11 points respectively, while maintaining 68 FPS real-time inference and strong zero-shot generalization. Code and dataset will be released at https://jinlab-imvr.github.io/SAM2S.