🤖 AI Summary
To address challenges in Large-Scale Video Object Segmentation (LSVOS)—including object re-appearance, small-scale instances, severe occlusion, and crowded scenes—this paper proposes an efficient segmentation framework built upon SAM2. The method introduces two key innovations: (1) a long-term memory module to enhance cross-frame object re-identification robustness, and (2) a SAM2Long post-processing strategy that explicitly models long-range temporal dependencies to mitigate error accumulation. By integrating these components, the framework achieves significantly improved segmentation stability without compromising inference efficiency. Evaluated on the MOSE test set, it attains a J&F score of 0.8427, ranking third in the ICCV 2025 LSVOS Challenge. This result validates the effectiveness of synergistically combining long-term memory with explicit long-horizon temporal modeling for robust video object segmentation.
📝 Abstract
Large-scale Video Object Segmentation (LSVOS) addresses the challenge of accurately tracking and segmenting objects in long video sequences, where difficulties stem from object reappearance, small-scale targets, heavy occlusions, and crowded scenes. Existing approaches predominantly adopt SAM2-based frameworks with various memory mechanisms for complex video mask generation. In this report, we proposed Segment Anything with Memory Strengthened Object Navigation (SAMSON), the 3rd place solution in the MOSE track of ICCV 2025, which integrates the strengths of stateof-the-art VOS models into an effective paradigm. To handle visually similar instances and long-term object disappearance in MOSE, we incorporate a long-term memorymodule for reliable object re-identification. Additionly, we adopt SAM2Long as a post-processing strategy to reduce error accumulation and enhance segmentation stability in long video sequences. Our method achieved a final performance of 0.8427 in terms of J &F in the test-set leaderboard.