🤖 AI Summary
To address the limited robustness and efficiency of SAM2 in video object segmentation and tracking (VOST) for surgical videos—caused by complex object motion and redundant historical features—this paper proposes TSMS-SAM2. Methodologically: (1) a multi-scale temporal sampling strategy is introduced to better model rapid motion dynamics; (2) a memory splitting and dynamic pruning mechanism is designed to explicitly organize and compress historical frame features, mitigating memory redundancy. The framework preserves SAM2’s prompt-driven paradigm while significantly improving temporal consistency and segmentation accuracy. Evaluated on EndoVis2017 and EndoVis2018, TSMS-SAM2 achieves mean Dice scores of 95.24% and 86.73%, respectively—surpassing both existing SAM-based and dedicated VOST methods. These results validate its effectiveness and state-of-the-art performance in dynamic minimally invasive surgical scenarios.
📝 Abstract
Promptable video object segmentation and tracking (VOST) has seen significant advances with the emergence of foundation models like Segment Anything Model 2 (SAM2); however, their application in surgical video analysis remains challenging due to complex motion dynamics and the redundancy of memory that impedes effective learning. In this work, we propose TSMS-SAM2, a novel framework that enhances promptable VOST in surgical videos by addressing challenges of rapid object motion and memory redundancy in SAM2. TSMS-SAM2 introduces two key strategies: multi-temporal-scale video sampling augmentation to improve robustness against motion variability, and a memory splitting and pruning mechanism that organizes and filters past frame features for more efficient and accurate segmentation. Evaluated on EndoVis2017 and EndoVis2018 datasets, TSMS-SAM2 achieved the highest mean Dice scores of 95.24 and 86.73, respectively, outperforming prior SAM-based and task-specific methods. Extensive ablation studies confirm the effectiveness of multiscale temporal augmentation and memory splitting, highlighting the framework's potential for robust, efficient segmentation in complex surgical scenarios. Our source code will be available at https://github.com/apple1986/TSMS-SAM2.