MoSAM: Motion-Guided Segment Anything Model with Spatial-Temporal Memory Selection

📅 2025-04-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
SAM2 faces two key limitations in video object segmentation: (1) its fixed six-frame memory window struggles with long-term target disappearance, and (2) static memory representations are vulnerable to occlusion or erroneous segmentations, degrading tracking robustness over time. To address these, we propose Motion-Guided Prompting (MGP) and Spatio-Temporal Memory Selection (ST-MS). MGP explicitly models object motion by fusing sparse motion vectors with dense optical flow cues; ST-MS dynamically selects reliable memory frames and pixels based on per-pixel confidence and frame-level quality scores, thereby relaxing the rigid memory window constraint. Integrated into the SAM2 framework, our approach further incorporates multi-scale feature injection, adaptive memory gating, and confidence-driven filtering. Evaluated on multiple video segmentation benchmarks, it achieves state-of-the-art performance, notably improving segmentation accuracy and trajectory continuity under occlusion and target reappearance scenarios.

Technology Category

Application Category

📝 Abstract
The recent Segment Anything Model 2 (SAM2) has demonstrated exceptional capabilities in interactive object segmentation for both images and videos. However, as a foundational model on interactive segmentation, SAM2 performs segmentation directly based on mask memory from the past six frames, leading to two significant challenges. Firstly, during inference in videos, objects may disappear since SAM2 relies solely on memory without accounting for object motion information, which limits its long-range object tracking capabilities. Secondly, its memory is constructed from fixed past frames, making it susceptible to challenges associated with object disappearance or occlusion, due to potentially inaccurate segmentation results in memory. To address these problems, we present MoSAM, incorporating two key strategies to integrate object motion cues into the model and establish more reliable feature memory. Firstly, we propose Motion-Guided Prompting (MGP), which represents the object motion in both sparse and dense manners, then injects them into SAM2 through a set of motion-guided prompts. MGP enables the model to adjust its focus towards the direction of motion, thereby enhancing the object tracking capabilities. Furthermore, acknowledging that past segmentation results may be inaccurate, we devise a Spatial-Temporal Memory Selection (ST-MS) mechanism that dynamically identifies frames likely to contain accurate segmentation in both pixel- and frame-level. By eliminating potentially inaccurate mask predictions from memory, we can leverage more reliable memory features to exploit similar regions for improving segmentation results. Extensive experiments on various benchmarks of video object segmentation and video instance segmentation demonstrate that our MoSAM achieves state-of-the-art results compared to other competitors.
Problem

Research questions and friction points this paper is trying to address.

SAM2 lacks motion tracking in video segmentation
Fixed memory frames cause unreliable segmentation results
MoSAM integrates motion cues and dynamic memory selection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Motion-Guided Prompting injects motion cues into SAM2
Spatial-Temporal Memory Selection dynamically filters reliable frames
Combines sparse and dense motion representations for tracking
🔎 Similar Papers
No similar papers found.