Temporal Prompting Matters: Rethinking Referring Video Object Segmentation

πŸ“… 2025-10-08
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Referring to video object segmentation (RVOS) suffers from reliance on dense mask supervision, high computational overhead, and poor generalization. To address these issues, this paper proposes Tenetβ€”a modular framework that decouples RVOS into three subtasks: referring expression understanding, temporal modeling, and pixel-level segmentation. Tenet leverages off-the-shelf detectors and trackers to generate semantically aligned temporal prompts and introduces a prompt preference learning mechanism to automatically evaluate and select high-quality prompts. These prompts then drive an image-level foundational segmentation model to perform video-level inference. Crucially, Tenet eliminates end-to-end mask supervision entirely, significantly improving model scalability and deployment efficiency. On multiple RVOS benchmarks, Tenet achieves state-of-the-art or near-state-of-the-art performance with substantially lower training cost, empirically validating the effectiveness and generality of the prompt-driven paradigm for RVOS.

Technology Category

Application Category

πŸ“ Abstract
Referring Video Object Segmentation (RVOS) aims to segment the object referred to by the query sentence in the video. Most existing methods require end-to-end training with dense mask annotations, which could be computation-consuming and less scalable. In this work, we rethink the RVOS problem and aim to investigate the key to this task. Based on existing foundation segmentation models, we decompose the RVOS task into referring, video, and segmentation factors, and propose a Temporal Prompt Generation and Selection (Tenet) framework to address the referring and video factors while leaving the segmentation problem to foundation models. To efficiently adapt image-based foundation segmentation models to referring video object segmentation, we leverage off-the-shelf object detectors and trackers to produce temporal prompts associated with the referring sentence. While high-quality temporal prompts could be produced, they can not be easily identified from confidence scores. To tackle this issue, we propose Prompt Preference Learning to evaluate the quality of the produced temporal prompts. By taking such prompts to instruct image-based foundation segmentation models, we would be able to produce high-quality masks for the referred object, enabling efficient model adaptation to referring video object segmentation. Experiments on RVOS benchmarks demonstrate the effectiveness of the Tenet framework.
Problem

Research questions and friction points this paper is trying to address.

Adapting image segmentation models to video object segmentation
Identifying high-quality temporal prompts without confidence scores
Reducing computational cost of referring video object segmentation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages object detectors and trackers for temporal prompts
Proposes Prompt Preference Learning to evaluate prompt quality
Instructs image-based foundation models for video segmentation
πŸ”Ž Similar Papers
No similar papers found.