ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations

📅 2025-01-24

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

Existing RVOS methods struggle to accurately model fine-grained alignment between complex textual descriptions and spatiotemporal video features. To address this, we propose the first efficient adaptation of vision-language foundation models to the RVOS task, introducing three core innovations: (1) an Object-Consistent Temporal Enhancer that enforces semantic stability of target objects across frames; (2) a Localization-Guided Deformable Mask Decoder for text-driven, pixel-precise segmentation; and (3) a Confidence-Aware Query Pruning strategy that dynamically optimizes decoding efficiency. Our method integrates vision-language joint pretraining, a deformable DETR architecture, and temporal feature propagation. Evaluated on five mainstream RVOS benchmarks, it consistently outperforms state-of-the-art methods, achieving superior trade-offs between segmentation accuracy and inference efficiency.

Technology Category

Application Category

📝 Abstract

Referring video object segmentation (RVOS) aims to segment target objects throughout a video based on a text description. Despite notable progress in recent years, current RVOS models remain struggle to handle complicated object descriptions due to their limited video-language understanding. To address this limitation, we present extbf{ReferDINO}, an end-to-end RVOS model that inherits strong vision-language understanding from the pretrained visual grounding foundation models, and is further endowed with effective temporal understanding and object segmentation capabilities. In ReferDINO, we contribute three technical innovations for effectively adapting the foundation models to RVOS: 1) an object-consistent temporal enhancer that capitalizes on the pretrained object-text representations to enhance temporal understanding and object consistency; 2) a grounding-guided deformable mask decoder that integrates text and grounding conditions to generate accurate object masks; 3) a confidence-aware query pruning strategy that significantly improves the object decoding efficiency without compromising performance. We conduct extensive experiments on five public RVOS benchmarks to demonstrate that our proposed ReferDINO outperforms state-of-the-art methods significantly. Project page: url{https://isee-laboratory.github.io/ReferDINO}

Problem

Research questions and friction points this paper is trying to address.

Video Object Segmentation

Complex Text Descriptions

Video-Text Correlation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Temporal Consistency Enhancement

Text-guided Decoding for Object Segmentation

Query Optimization for Efficiency

🔎 Similar Papers

No similar papers found.