🤖 AI Summary
To address weak feature representation and unstable temporal modeling in video object segmentation (VOS), this paper proposes SCOPE: a novel VOS framework that replaces Cutie’s original encoder with SAM2’s ViT encoder to significantly enhance spatial-semantic feature expressiveness; introduces a flow-guided query-based motion prediction module to explicitly model inter-frame motion and improve temporal consistency; and incorporates a multi-model ensemble strategy for end-to-end joint optimization. By jointly strengthening discriminative feature encoding and explicit motion modeling, SCOPE achieves a balanced trade-off between segmentation accuracy and temporal stability. Evaluated on the MOSEv2 track of the 7th LSVOS Challenge, SCOPE ranks third, empirically validating the critical role of robust feature encoding and explicit motion modeling in enhancing VOS robustness. The design principles—leveraging foundation-model features, incorporating geometric priors via optical flow, and unifying ensemble learning within an end-to-end trainable architecture—offer generalizable insights for future VOS systems.
📝 Abstract
Video object segmentation (VOS) is a challenging task with wide applications such as video editing and autonomous driving. While Cutie provides strong query-based segmentation and SAM2 offers enriched representations via a pretrained ViT encoder, each has limitations in feature capacity and temporal modeling. In this report, we propose a framework that integrates their complementary strengths by replacing the encoder of Cutie with the ViT encoder of SAM2 and introducing a motion prediction module for temporal stability. We further adopt an ensemble strategy combining Cutie, SAM2, and our variant, achieving 3rd place in the MOSEv2 track of the 7th LSVOS Challenge. We refer to our final model as SCOPE (SAM2-CUTIE Object Prediction Ensemble). This demonstrates the effectiveness of enriched feature representation and motion prediction for robust video object segmentation. The code is available at https://github.com/2025-LSVOS-3rd-place/MOSEv2_3rd_place.