Enriched Feature Representation and Motion Prediction Module for MOSEv2 Track of 7th LSVOS Challenge: 3rd Place Solution

📅 2025-09-19

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

To address weak feature representation and unstable temporal modeling in video object segmentation (VOS), this paper proposes SCOPE: a novel VOS framework that replaces Cutie’s original encoder with SAM2’s ViT encoder to significantly enhance spatial-semantic feature expressiveness; introduces a flow-guided query-based motion prediction module to explicitly model inter-frame motion and improve temporal consistency; and incorporates a multi-model ensemble strategy for end-to-end joint optimization. By jointly strengthening discriminative feature encoding and explicit motion modeling, SCOPE achieves a balanced trade-off between segmentation accuracy and temporal stability. Evaluated on the MOSEv2 track of the 7th LSVOS Challenge, SCOPE ranks third, empirically validating the critical role of robust feature encoding and explicit motion modeling in enhancing VOS robustness. The design principles—leveraging foundation-model features, incorporating geometric priors via optical flow, and unifying ensemble learning within an end-to-end trainable architecture—offer generalizable insights for future VOS systems.

Technology Category

Application Category

📝 Abstract

Video object segmentation (VOS) is a challenging task with wide applications such as video editing and autonomous driving. While Cutie provides strong query-based segmentation and SAM2 offers enriched representations via a pretrained ViT encoder, each has limitations in feature capacity and temporal modeling. In this report, we propose a framework that integrates their complementary strengths by replacing the encoder of Cutie with the ViT encoder of SAM2 and introducing a motion prediction module for temporal stability. We further adopt an ensemble strategy combining Cutie, SAM2, and our variant, achieving 3rd place in the MOSEv2 track of the 7th LSVOS Challenge. We refer to our final model as SCOPE (SAM2-CUTIE Object Prediction Ensemble). This demonstrates the effectiveness of enriched feature representation and motion prediction for robust video object segmentation. The code is available at https://github.com/2025-LSVOS-3rd-place/MOSEv2_3rd_place.

Problem

Research questions and friction points this paper is trying to address.

Enhancing feature representation and temporal modeling in video object segmentation

Integrating complementary strengths of Cutie and SAM2 for improved performance

Addressing limitations in feature capacity and motion prediction for VOS

Innovation

Methods, ideas, or system contributions that make the work stand out.

Replaced encoder with SAM2 ViT

Added motion prediction module

Used ensemble strategy combining models

🔎 Similar Papers

No similar papers found.