Find First, Track Next: Decoupling Identification and Propagation in Referring Video Object Segmentation

📅 2025-03-05

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

This work addresses two key challenges in referring video object segmentation (RVOS): (i) target identification ambiguity—especially in scenes with multiple visually similar objects—and (ii) inconsistent cross-frame mask propagation. To this end, we propose FindTrack, a decoupled framework that uniquely separates target identification from mask propagation. Specifically, it first identifies the target via a vision-language alignment-driven, adaptive keyframe selection mechanism; then, using this keyframe as a reference, it performs independent, reference-guided mask propagation across frames. The method comprises four core components: multimodal feature disentanglement, alignment-aware keyframe selection, reference-guided propagation network, and joint optimization. Extensive experiments demonstrate that FindTrack achieves significant improvements over state-of-the-art methods across multiple benchmarks, delivering enhanced segmentation accuracy and tracking stability—particularly in challenging multi-similar-object scenarios.

Technology Category

Application Category

📝 Abstract

Referring video object segmentation aims to segment and track a target object in a video using a natural language prompt. Existing methods typically fuse visual and textual features in a highly entangled manner, processing multi-modal information together to generate per-frame masks. However, this approach often struggles with ambiguous target identification, particularly in scenes with multiple similar objects, and fails to ensure consistent mask propagation across frames. To address these limitations, we introduce FindTrack, a novel decoupled framework that separates target identification from mask propagation. FindTrack first adaptively selects a key frame by balancing segmentation confidence and vision-text alignment, establishing a robust reference for the target object. This reference is then utilized by a dedicated propagation module to track and segment the object across the entire video. By decoupling these processes, FindTrack effectively reduces ambiguities in target association and enhances segmentation consistency. We demonstrate that FindTrack outperforms existing methods on public benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Decouples target identification from mask propagation

Reduces ambiguities in target association

Enhances segmentation consistency across frames

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decouples target identification from mask propagation

Adaptively selects key frame for robust reference

Enhances segmentation consistency across video frames

🔎 Similar Papers

Context-Aware Video Instance Segmentation