LTCA: Long-range Temporal Context Attention for Referring Video Object Segmentation

📅 2025-10-09

🏛️ IEEE transactions on circuits and systems for video technology (Print)

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

Referring Video Object Segmentation (RVOS) requires modeling long-range temporal context to accurately localize dynamic objects described by natural language, yet existing methods face bottlenecks in balancing local–global contextual modeling and computational efficiency. This paper proposes a Long-Term Temporal Context Attention mechanism: it employs cross-frame dilated window attention for efficient local modeling, while integrating random global key sampling and explicit global queries to directly encode video-level temporal context—enhancing global awareness without increasing complexity. By avoiding full-frame attention and dense layer stacking, the method significantly improves inference efficiency for long videos. Evaluated on four major benchmarks—including MeViS—the approach achieves new state-of-the-art performance, with improvements of +11.3% and +8.3% on MeViS val/u and val sets, respectively.

Technology Category

Application Category

📝 Abstract

Referring Video Segmentation (RVOS) aims to segment objects in videos given linguistic expressions. The key to solving RVOS is to extract long-range temporal context information from the interactions of expressions and videos to depict the dynamic attributes of each object. Previous works either adopt attention across all the frames or stack dense local attention to achieve a global view of temporal context. However, they fail to strike a good balance between locality and globality, and the computation complexity significantly increases with the increase of video length. In this paper, we propose an effective long-range temporal context attention (LTCA) mechanism to aggregate global context information into object features. Specifically, we aggregate the global context information from two aspects. Firstly, we stack sparse local attentions to balance the locality and globality. We design a dilated window attention across frames to aggregate local context information and perform such attention in a stack of layers to enable a global view. Further, we enable each query to attend to a small group of keys randomly selected from a global pool to enhance the globality. Secondly, we design a global query to interact with all the other queries to directly encode the global context information. Experiments show our method achieves new state-of-the-art on four referring video segmentation benchmarks. Notably, our method shows an improvement of 11.3% and 8.3% on the MeViS valu and val datasets respectively.

Problem

Research questions and friction points this paper is trying to address.

Extracting long-range temporal context for video object segmentation

Balancing locality and globality in temporal attention mechanisms

Reducing computational complexity in long video sequence processing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Stack sparse local attentions for locality-globality balance

Use dilated window attention across video frames

Introduce global query to encode context directly

🔎 Similar Papers

Learning Spatial-Semantic Features for Robust Video Object Segmentation