Enriched Feature Representation and Motion Prediction Module for MOSEv2 Track of 7th LSVOS Challenge: 3rd Place Solution

📅 2025-09-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address weak feature representation and unstable temporal modeling in video object segmentation (VOS), this paper proposes SCOPE: a novel VOS framework that replaces Cutie’s original encoder with SAM2’s ViT encoder to significantly enhance spatial-semantic feature expressiveness; introduces a flow-guided query-based motion prediction module to explicitly model inter-frame motion and improve temporal consistency; and incorporates a multi-model ensemble strategy for end-to-end joint optimization. By jointly strengthening discriminative feature encoding and explicit motion modeling, SCOPE achieves a balanced trade-off between segmentation accuracy and temporal stability. Evaluated on the MOSEv2 track of the 7th LSVOS Challenge, SCOPE ranks third, empirically validating the critical role of robust feature encoding and explicit motion modeling in enhancing VOS robustness. The design principles—leveraging foundation-model features, incorporating geometric priors via optical flow, and unifying ensemble learning within an end-to-end trainable architecture—offer generalizable insights for future VOS systems.

Technology Category

Application Category

📝 Abstract
Video object segmentation (VOS) is a challenging task with wide applications such as video editing and autonomous driving. While Cutie provides strong query-based segmentation and SAM2 offers enriched representations via a pretrained ViT encoder, each has limitations in feature capacity and temporal modeling. In this report, we propose a framework that integrates their complementary strengths by replacing the encoder of Cutie with the ViT encoder of SAM2 and introducing a motion prediction module for temporal stability. We further adopt an ensemble strategy combining Cutie, SAM2, and our variant, achieving 3rd place in the MOSEv2 track of the 7th LSVOS Challenge. We refer to our final model as SCOPE (SAM2-CUTIE Object Prediction Ensemble). This demonstrates the effectiveness of enriched feature representation and motion prediction for robust video object segmentation. The code is available at https://github.com/2025-LSVOS-3rd-place/MOSEv2_3rd_place.
Problem

Research questions and friction points this paper is trying to address.

Enhancing feature representation and temporal modeling in video object segmentation
Integrating complementary strengths of Cutie and SAM2 for improved performance
Addressing limitations in feature capacity and motion prediction for VOS
Innovation

Methods, ideas, or system contributions that make the work stand out.

Replaced encoder with SAM2 ViT
Added motion prediction module
Used ensemble strategy combining models
🔎 Similar Papers
No similar papers found.
C
Chang Soo Lim
Computer Vision Lab., Department of Computer Science, Hanyang University, Seoul, South Korea
J
Joonyoung Moon
Computer Vision Lab., Department of Computer Science, Hanyang University, Seoul, South Korea
Donghyeon Cho
Donghyeon Cho
Associate Professor, Computer Science, Hanyang University
Computer VisionImage ProcessingDeep Learning