Referring Video Object Segmentation with Cross-Modality Proxy Queries

📅 2025-11-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In referring expression video object segmentation (RVOS), existing methods suffer from unstable tracking and semantic drift due to (i) insufficient inter-frame modeling in conditional queries and (ii) delayed integration of textual constraints. To address these issues, we propose ProxyFormer, a novel architecture featuring evolvable cross-modal proxy queries that explicitly model temporal dependencies and enable early, tight alignment between text and video features. We further design a spatiotemporal-decoupled Transformer with staged feature encoding and introduce joint semantic consistency (JSC) training to enforce coherent semantics across frames. Evaluated on four mainstream RVOS benchmarks, ProxyFormer achieves significant improvements in both segmentation accuracy and tracking coherence while incurring lower computational overhead. Notably, it is the first framework to jointly optimize semantic fidelity and temporal stability within a unified architecture.

Technology Category

Application Category

📝 Abstract
Referring video object segmentation (RVOS) is an emerging cross-modality task that aims to generate pixel-level maps of the target objects referred by given textual expressions. The main concept involves learning an accurate alignment of visual elements and language expressions within a semantic space. Recent approaches address cross-modality alignment through conditional queries, tracking the target object using a query-response based mechanism built upon transformer structure. However, they exhibit two limitations: (1) these conditional queries lack inter-frame dependency and variation modeling, making accurate target tracking challenging amid significant frame-to-frame variations; and (2) they integrate textual constraints belatedly, which may cause the video features potentially focus on the non-referred objects. Therefore, we propose a novel RVOS architecture called ProxyFormer, which introduces a set of proxy queries to integrate visual and text semantics and facilitate the flow of semantics between them. By progressively updating and propagating proxy queries across multiple stages of video feature encoder, ProxyFormer ensures that the video features are focused on the object of interest. This dynamic evolution also enables the establishment of inter-frame dependencies, enhancing the accuracy and coherence of object tracking. To mitigate high computational costs, we decouple cross-modality interactions into temporal and spatial dimensions. Additionally, we design a Joint Semantic Consistency (JSC) training strategy to align semantic consensus between the proxy queries and the combined video-text pairs. Comprehensive experiments on four widely used RVOS benchmarks demonstrate the superiority of our ProxyFormer to the state-of-the-art methods.
Problem

Research questions and friction points this paper is trying to address.

Tracking target objects across video frames with significant variations
Integrating textual constraints early to focus on referred objects
Establishing inter-frame dependencies for accurate object segmentation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proxy queries integrate visual and text semantics
Decouple cross-modality interactions into temporal spatial dimensions
Joint Semantic Consistency aligns proxy queries with video-text pairs
🔎 Similar Papers
No similar papers found.
Baoli Sun
Baoli Sun
Dalian University of Technology
Fine-grained video action recognition
Xinzhu Ma
Xinzhu Ma
Associate Professor, Beihang University
deep learningcomputer vision3D scene understandingai4science
N
Ning Wang
DUT-RU International School of Information Science & Engineering, Dalian University of Technology, China
Z
Zhihui Wang
DUT-RU International School of Information Science & Engineering, Dalian University of Technology, China
Z
Zhiyong Wang
University of Sydney, Australia