Temporal-Conditional Referring Video Object Segmentation with Noise-Free Text-to-Video Diffusion Model

📅 2025-08-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video object segmentation (RVOS) methods for referring expressions suffer from coarse-grained segmentation heads, weak semantic alignment between language and visual features, and low boundary precision. Method: We propose Temporal-Conditional Segmentation (TC-Seg), a novel framework featuring: (1) coupling with a text-to-video diffusion model for robust cross-modal feature extraction—removing the noise prediction module to enhance deterministic semantic representation; (2) a lightweight Temporal Context Mask Refinement (TCMR) module that explicitly models inter-frame temporal dependencies and refines mask boundaries; and (3) a restructured segmentation head enabling text-guided, temporally adaptive segmentation. Contribution/Results: TC-Seg achieves state-of-the-art performance across four major RVOS benchmarks, significantly improving both segmentation accuracy and boundary quality. Comprehensive ablation studies and cross-dataset evaluations demonstrate its effectiveness, robustness, and strong generalization capability.

Technology Category

Application Category

📝 Abstract
Referring Video Object Segmentation (RVOS) aims to segment specific objects in a video according to textual descriptions. We observe that recent RVOS approaches often place excessive emphasis on feature extraction and temporal modeling, while relatively neglecting the design of the segmentation head. In fact, there remains considerable room for improvement in segmentation head design. To address this, we propose a Temporal-Conditional Referring Video Object Segmentation model, which innovatively integrates existing segmentation methods to effectively enhance boundary segmentation capability. Furthermore, our model leverages a text-to-video diffusion model for feature extraction. On top of this, we remove the traditional noise prediction module to avoid the randomness of noise from degrading segmentation accuracy, thereby simplifying the model while improving performance. Finally, to overcome the limited feature extraction capability of the VAE, we design a Temporal Context Mask Refinement (TCMR) module, which significantly improves segmentation quality without introducing complex designs. We evaluate our method on four public RVOS benchmarks, where it consistently achieves state-of-the-art performance.
Problem

Research questions and friction points this paper is trying to address.

Enhancing boundary segmentation in referring video object segmentation
Reducing noise-induced accuracy degradation in diffusion models
Improving feature extraction beyond VAE limitations for segmentation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates existing segmentation methods for boundaries
Uses noise-free diffusion model for feature extraction
Adds Temporal Context Mask Refinement module
🔎 Similar Papers
No similar papers found.
Ruixin Zhang
Ruixin Zhang
tencent
computer vision
J
Jiaqing Fan
School of Computer Science and Technology, Soochow University
Y
Yifan Liao
School of Computer Science and Technology, Soochow University
Q
Qian Qiao
School of Computer Science and Technology, Soochow University
F
Fanzhang Li
School of Computer Science and Technology, Soochow University