🤖 AI Summary
Semi-supervised video object segmentation (VOS) suffers from limited robustness under drastic appearance changes, occlusions, and scene transitions, primarily due to insufficient high-level semantic understanding of the target. To address this, we propose the Segment Concept (SeC) framework—the first to integrate large vision-language models (LVLMs) into zero-shot VOS without fine-tuning. SeC constructs deep, concept-driven semantic representations of the target by explicitly aligning textual prompts with visual features via LVLMs, enabling concept-guided, temporally consistent segmentation across frames. By modeling the target’s intrinsic semantic concept, SeC significantly enhances tracking and segmentation stability in complex dynamic scenes. On MOSEv2, SeC achieves a J&F score of 39.7; it ranked second in the Complex Track of the 7th Large-Scale Video Segmentation Challenge. These results validate the effectiveness and state-of-the-art performance of the semantic-driven paradigm for zero-shot VOS.
📝 Abstract
Semi-supervised Video Object Segmentation aims to segment a specified target throughout a video sequence, initialized by a first-frame mask. Previous methods rely heavily on appearance-based pattern matching and thus exhibit limited robustness against challenges such as drastic visual changes, occlusions, and scene shifts. This failure is often attributed to a lack of high-level conceptual understanding of the target. The recently proposed Segment Concept (SeC) framework mitigated this limitation by using a Large Vision-Language Model (LVLM) to establish a deep semantic understanding of the object for more persistent segmentation. In this work, we evaluate its zero-shot performance on the challenging coMplex video Object SEgmentation v2 (MOSEv2) dataset. Without any fine-tuning on the training set, SeC achieved 39.7 JFn on the test set and ranked 2nd place in the Complex VOS track of the 7th Large-scale Video Object Segmentation Challenge.