Deforming Videos to Masks: Flow Matching for Referring Video Segmentation

📅 2025-10-07

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Referring Video Object Segmentation (RVOS) faces two key challenges: weak semantic alignment between language and visual content, and temporal inconsistency across frames. Existing “localize-then-segment” two-stage paradigms suffer from information bottlenecks and cross-modal disentanglement. This paper reformulates RVOS as a language-conditioned continuous flow generation problem—the first such formulation—and proposes an end-to-end framework based on Flow Matching. Leveraging a pre-trained text-to-video (T2V) model, our approach enables fine-grained, pixel-level language guidance and directly generates object masks from video latent representations, unifying semantic alignment and temporal modeling. Crucially, it eliminates reliance on geometric prompts and enhances inter-frame consistency. On MeViS, our method achieves 51.1 J&F (+1.6 over prior SOTA); under zero-shot transfer to Ref-DAVIS17, it attains 73.3 J&F, establishing new state-of-the-art results across multiple benchmarks.

Technology Category

Application Category

📝 Abstract

Referring Video Object Segmentation (RVOS) requires segmenting specific objects in a video guided by a natural language description. The core challenge of RVOS is to anchor abstract linguistic concepts onto a specific set of pixels and continuously segment them through the complex dynamics of a video. Faced with this difficulty, prior work has often decomposed the task into a pragmatic `locate-then-segment'pipeline. However, this cascaded design creates an information bottleneck by simplifying semantics into coarse geometric prompts (e.g, point), and struggles to maintain temporal consistency as the segmenting process is often decoupled from the initial language grounding. To overcome these fundamental limitations, we propose FlowRVS, a novel framework that reconceptualizes RVOS as a conditional continuous flow problem. This allows us to harness the inherent strengths of pretrained T2V models, fine-grained pixel control, text-video semantic alignment, and temporal coherence. Instead of conventional generating from noise to mask or directly predicting mask, we reformulate the task by learning a direct, language-guided deformation from a video's holistic representation to its target mask. Our one-stage, generative approach achieves new state-of-the-art results across all major RVOS benchmarks. Specifically, achieving a $mathcal{J}&mathcal{F}$ of 51.1 in MeViS (+1.6 over prior SOTA) and 73.3 in the zero shot Ref-DAVIS17 (+2.7), demonstrating the significant potential of modeling video understanding tasks as continuous deformation processes.

Problem

Research questions and friction points this paper is trying to address.

Segmenting video objects using natural language descriptions

Anchoring linguistic concepts to pixels across video dynamics

Maintaining temporal consistency in language-guided video segmentation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reformulating RVOS as conditional continuous flow problem

Learning language-guided deformation from video to mask

Leveraging pretrained T2V models for temporal coherence

🔎 Similar Papers

No similar papers found.