Deforming Videos to Masks: Flow Matching for Referring Video Segmentation

📅 2025-10-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Referring Video Object Segmentation (RVOS) faces two key challenges: weak semantic alignment between language and visual content, and temporal inconsistency across frames. Existing “localize-then-segment” two-stage paradigms suffer from information bottlenecks and cross-modal disentanglement. This paper reformulates RVOS as a language-conditioned continuous flow generation problem—the first such formulation—and proposes an end-to-end framework based on Flow Matching. Leveraging a pre-trained text-to-video (T2V) model, our approach enables fine-grained, pixel-level language guidance and directly generates object masks from video latent representations, unifying semantic alignment and temporal modeling. Crucially, it eliminates reliance on geometric prompts and enhances inter-frame consistency. On MeViS, our method achieves 51.1 J&F (+1.6 over prior SOTA); under zero-shot transfer to Ref-DAVIS17, it attains 73.3 J&F, establishing new state-of-the-art results across multiple benchmarks.

Technology Category

Application Category

📝 Abstract
Referring Video Object Segmentation (RVOS) requires segmenting specific objects in a video guided by a natural language description. The core challenge of RVOS is to anchor abstract linguistic concepts onto a specific set of pixels and continuously segment them through the complex dynamics of a video. Faced with this difficulty, prior work has often decomposed the task into a pragmatic `locate-then-segment'pipeline. However, this cascaded design creates an information bottleneck by simplifying semantics into coarse geometric prompts (e.g, point), and struggles to maintain temporal consistency as the segmenting process is often decoupled from the initial language grounding. To overcome these fundamental limitations, we propose FlowRVS, a novel framework that reconceptualizes RVOS as a conditional continuous flow problem. This allows us to harness the inherent strengths of pretrained T2V models, fine-grained pixel control, text-video semantic alignment, and temporal coherence. Instead of conventional generating from noise to mask or directly predicting mask, we reformulate the task by learning a direct, language-guided deformation from a video's holistic representation to its target mask. Our one-stage, generative approach achieves new state-of-the-art results across all major RVOS benchmarks. Specifically, achieving a $mathcal{J}&mathcal{F}$ of 51.1 in MeViS (+1.6 over prior SOTA) and 73.3 in the zero shot Ref-DAVIS17 (+2.7), demonstrating the significant potential of modeling video understanding tasks as continuous deformation processes.
Problem

Research questions and friction points this paper is trying to address.

Segmenting video objects using natural language descriptions
Anchoring linguistic concepts to pixels across video dynamics
Maintaining temporal consistency in language-guided video segmentation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reformulating RVOS as conditional continuous flow problem
Learning language-guided deformation from video to mask
Leveraging pretrained T2V models for temporal coherence
🔎 Similar Papers
No similar papers found.
Z
Zanyi Wang
SGIT AI Lab, State Grid Corporation of China; University of California, San Diego
Dengyang Jiang
Dengyang Jiang
Northwestern Polytechnical University
Computer VisionDeep LearningMachine Learning
L
Liuzhuozheng Li
SGIT AI Lab, State Grid Corporation of China; The University of Tokyo
Sizhe Dang
Sizhe Dang
PhD of Computer Science, Xi'an Jiaotong University
Computer VisionMultimodal AnalysisOptimization Analysis.
Chengzu Li
Chengzu Li
University of Cambridge
Natural Language Processing
Harry Yang
Harry Yang
HKUST
computer visionmachine learning
G
Guang Dai
SGIT AI Lab, State Grid Corporation of China
M
Mengmeng Wang
SGIT AI Lab, State Grid Corporation of China; Zhejiang University of Technology
J
Jingdong Wang
SGIT AI Lab, State Grid Corporation of China; Baidu