Referring Video Object Segmentation with Cross-Modality Proxy Queries

📅 2025-11-26

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

In referring expression video object segmentation (RVOS), existing methods suffer from unstable tracking and semantic drift due to (i) insufficient inter-frame modeling in conditional queries and (ii) delayed integration of textual constraints. To address these issues, we propose ProxyFormer, a novel architecture featuring evolvable cross-modal proxy queries that explicitly model temporal dependencies and enable early, tight alignment between text and video features. We further design a spatiotemporal-decoupled Transformer with staged feature encoding and introduce joint semantic consistency (JSC) training to enforce coherent semantics across frames. Evaluated on four mainstream RVOS benchmarks, ProxyFormer achieves significant improvements in both segmentation accuracy and tracking coherence while incurring lower computational overhead. Notably, it is the first framework to jointly optimize semantic fidelity and temporal stability within a unified architecture.

Technology Category

Application Category

📝 Abstract

Referring video object segmentation (RVOS) is an emerging cross-modality task that aims to generate pixel-level maps of the target objects referred by given textual expressions. The main concept involves learning an accurate alignment of visual elements and language expressions within a semantic space. Recent approaches address cross-modality alignment through conditional queries, tracking the target object using a query-response based mechanism built upon transformer structure. However, they exhibit two limitations: (1) these conditional queries lack inter-frame dependency and variation modeling, making accurate target tracking challenging amid significant frame-to-frame variations; and (2) they integrate textual constraints belatedly, which may cause the video features potentially focus on the non-referred objects. Therefore, we propose a novel RVOS architecture called ProxyFormer, which introduces a set of proxy queries to integrate visual and text semantics and facilitate the flow of semantics between them. By progressively updating and propagating proxy queries across multiple stages of video feature encoder, ProxyFormer ensures that the video features are focused on the object of interest. This dynamic evolution also enables the establishment of inter-frame dependencies, enhancing the accuracy and coherence of object tracking. To mitigate high computational costs, we decouple cross-modality interactions into temporal and spatial dimensions. Additionally, we design a Joint Semantic Consistency (JSC) training strategy to align semantic consensus between the proxy queries and the combined video-text pairs. Comprehensive experiments on four widely used RVOS benchmarks demonstrate the superiority of our ProxyFormer to the state-of-the-art methods.

Problem

Research questions and friction points this paper is trying to address.

Tracking target objects across video frames with significant variations

Integrating textual constraints early to focus on referred objects

Establishing inter-frame dependencies for accurate object segmentation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proxy queries integrate visual and text semantics

Decouple cross-modality interactions into temporal spatial dimensions

Joint Semantic Consistency aligns proxy queries with video-text pairs

🔎 Similar Papers

No similar papers found.