Referring Video Object Segmentation via Language-aligned Track Selection

📅 2024-12-02
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Referring Video Object Segmentation (RVOS) requires joint modeling of language understanding, visual representation, and motion reasoning. This paper proposes SOLA, the first framework to decouple RVOS into two stages: (1) generating high-consistency mask trajectories using SAM2, and (2) a lightweight selection module that jointly models appearance, spatiotemporal motion, and cross-modal alignment, optimized via contrastive learning for language-trajectory matching. Our core innovations lie in this principled decoupling design and a novel multimodal joint selection mechanism. SOLA achieves state-of-the-art performance on MeViS and demonstrates significant gains under zero-shot transfer to Ref-Youtube-VOS and Ref-DAVIS. Moreover, it exhibits strong robustness against challenging degradations—including noise and motion blur—while maintaining efficiency and generalizability.

Technology Category

Application Category

📝 Abstract
Referring Video Object Segmentation (RVOS) seeks to segment objects throughout a video based on natural language expressions. While existing methods have made strides in vision-language alignment, they often overlook the importance of robust video object tracking, where inconsistent mask tracks can disrupt vision-language alignment, leading to suboptimal performance. In this work, we present Selection by Object Language Alignment (SOLA), a novel framework that reformulates RVOS into two sub-problems, track generation and track selection. In track generation, we leverage a vision foundation model, Segment Anything Model 2 (SAM2), which generates consistent mask tracks across frames, producing reliable candidates for both foreground and background objects. For track selection, we propose a light yet effective selection module that aligns visual and textual features while modeling object appearance and motion within video sequences. This design enables precise motion modeling and alignment of the vision language. Our approach achieves state-of-the-art performance on the challenging MeViS dataset and demonstrates superior results in zero-shot settings on the Ref-Youtube-VOS and Ref-DAVIS datasets. Furthermore, SOLA exhibits strong generalization and robustness in corrupted settings, such as those with added Gaussian noise or motion blur. Our project page is available at https://cvlab-kaist.github.io/SOLA
Problem

Research questions and friction points this paper is trying to address.

Track and segment objects using natural language descriptions
Align visual representations with language features effectively
Bridge modality gap between SAM2 and language features
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages SAM2 object tokens for video-level representations
Aligns object tokens with language via lightweight module
Uses IoU-based pseudo-labeling to bridge modality gap
🔎 Similar Papers
No similar papers found.
S
Seongchan Kim
Korea University
Woojeong Jin
Woojeong Jin
University of Southern California
Multimodal LearningNatural Language Processing
S
Sangbeom Lim
Korea University
H
Heeji Yoon
Korea University
H
Hyunwook Choi
Korea University
Seungryong Kim
Seungryong Kim
Associate Professor, KAIST
Computer VisionMachine Learning