VideoRoPE: What Makes for Good Video Rotary Position Embedding?

📅 2025-02-07

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Conventional 1D RoPE for video modeling suffers from inadequate temporal dimension representation, making it susceptible to distractors and unable to capture complex spatiotemporal structures. Method: We systematically analyze four fundamental properties of video RoPE and propose the first four-dimensional design principle; introduce V-NIAH-D, the first benchmark for evaluating video positional encodings under periodic interference; and present 3D VideoRoPE—a decoupled spatiotemporal rotational positional encoding that supports low-frequency temporal embedding, diagonalized spatial mapping, and adjustable temporal stride to suppress temporal oscillation. Contribution/Results: Experiments demonstrate that VideoRoPE consistently outperforms existing RoPE variants across long-video retrieval, video understanding, and video hallucination detection tasks, achieving significant improvements in robustness against interference and long-range temporal modeling capability.

Technology Category

Application Category

📝 Abstract

While Rotary Position Embedding (RoPE) and its variants are widely adopted for their long-context capabilities, the extension of the 1D RoPE to video, with its complex spatio-temporal structure, remains an open challenge. This work first introduces a comprehensive analysis that identifies four key characteristics essential for the effective adaptation of RoPE to video, which have not been fully considered in prior work. As part of our analysis, we introduce a challenging V-NIAH-D (Visual Needle-In-A-Haystack with Distractors) task, which adds periodic distractors into V-NIAH. The V-NIAH-D task demonstrates that previous RoPE variants, lacking appropriate temporal dimension allocation, are easily misled by distractors. Based on our analysis, we introduce extbf{VideoRoPE}, with a extit{3D structure} designed to preserve spatio-temporal relationships. VideoRoPE features extit{low-frequency temporal allocation} to mitigate periodic oscillations, a extit{diagonal layout} to maintain spatial symmetry, and extit{adjustable temporal spacing} to decouple temporal and spatial indexing. VideoRoPE consistently surpasses previous RoPE variants, across diverse downstream tasks such as long video retrieval, video understanding, and video hallucination. Our code will be available at href{https://github.com/Wiselnn570/VideoRoPE}{https://github.com/Wiselnn570/VideoRoPE}.

Problem

Research questions and friction points this paper is trying to address.

Extends 1D RoPE to video

Addresses spatio-temporal relationship preservation

Mitigates periodic distractors in video tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

3D structure preserves spatio-temporal relationships

Low-frequency temporal allocation reduces oscillations

Adjustable temporal spacing decouples indexing

🔎 Similar Papers

VideoPrism: A Foundational Visual Encoder for Video Understanding