2nd Place Report of MOSEv2 Challenge 2025: Concept Guided Video Object Segmentation via SeC

📅 2025-09-28

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Semi-supervised video object segmentation (VOS) suffers from limited robustness under drastic appearance changes, occlusions, and scene transitions, primarily due to insufficient high-level semantic understanding of the target. To address this, we propose the Segment Concept (SeC) framework—the first to integrate large vision-language models (LVLMs) into zero-shot VOS without fine-tuning. SeC constructs deep, concept-driven semantic representations of the target by explicitly aligning textual prompts with visual features via LVLMs, enabling concept-guided, temporally consistent segmentation across frames. By modeling the target’s intrinsic semantic concept, SeC significantly enhances tracking and segmentation stability in complex dynamic scenes. On MOSEv2, SeC achieves a J&F score of 39.7; it ranked second in the Complex Track of the 7th Large-Scale Video Segmentation Challenge. These results validate the effectiveness and state-of-the-art performance of the semantic-driven paradigm for zero-shot VOS.

Technology Category

Application Category

📝 Abstract

Semi-supervised Video Object Segmentation aims to segment a specified target throughout a video sequence, initialized by a first-frame mask. Previous methods rely heavily on appearance-based pattern matching and thus exhibit limited robustness against challenges such as drastic visual changes, occlusions, and scene shifts. This failure is often attributed to a lack of high-level conceptual understanding of the target. The recently proposed Segment Concept (SeC) framework mitigated this limitation by using a Large Vision-Language Model (LVLM) to establish a deep semantic understanding of the object for more persistent segmentation. In this work, we evaluate its zero-shot performance on the challenging coMplex video Object SEgmentation v2 (MOSEv2) dataset. Without any fine-tuning on the training set, SeC achieved 39.7 JFn on the test set and ranked 2nd place in the Complex VOS track of the 7th Large-scale Video Object Segmentation Challenge.

Problem

Research questions and friction points this paper is trying to address.

Segmenting video objects with first-frame initialization

Addressing robustness against visual changes and occlusions

Enhancing semantic understanding via vision-language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Segment Concept framework uses Large Vision-Language Model

Establishes deep semantic understanding for object segmentation

Achieves zero-shot performance without dataset fine-tuning

🔎 Similar Papers

No similar papers found.