π€ AI Summary
Conventional 1D RoPE for video modeling suffers from inadequate temporal dimension representation, making it susceptible to distractors and unable to capture complex spatiotemporal structures. Method: We systematically analyze four fundamental properties of video RoPE and propose the first four-dimensional design principle; introduce V-NIAH-D, the first benchmark for evaluating video positional encodings under periodic interference; and present 3D VideoRoPEβa decoupled spatiotemporal rotational positional encoding that supports low-frequency temporal embedding, diagonalized spatial mapping, and adjustable temporal stride to suppress temporal oscillation. Contribution/Results: Experiments demonstrate that VideoRoPE consistently outperforms existing RoPE variants across long-video retrieval, video understanding, and video hallucination detection tasks, achieving significant improvements in robustness against interference and long-range temporal modeling capability.
π Abstract
While Rotary Position Embedding (RoPE) and its variants are widely adopted for their long-context capabilities, the extension of the 1D RoPE to video, with its complex spatio-temporal structure, remains an open challenge. This work first introduces a comprehensive analysis that identifies four key characteristics essential for the effective adaptation of RoPE to video, which have not been fully considered in prior work. As part of our analysis, we introduce a challenging V-NIAH-D (Visual Needle-In-A-Haystack with Distractors) task, which adds periodic distractors into V-NIAH. The V-NIAH-D task demonstrates that previous RoPE variants, lacking appropriate temporal dimension allocation, are easily misled by distractors. Based on our analysis, we introduce extbf{VideoRoPE}, with a extit{3D structure} designed to preserve spatio-temporal relationships. VideoRoPE features extit{low-frequency temporal allocation} to mitigate periodic oscillations, a extit{diagonal layout} to maintain spatial symmetry, and extit{adjustable temporal spacing} to decouple temporal and spatial indexing. VideoRoPE consistently surpasses previous RoPE variants, across diverse downstream tasks such as long video retrieval, video understanding, and video hallucination. Our code will be available at href{https://github.com/Wiselnn570/VideoRoPE}{https://github.com/Wiselnn570/VideoRoPE}.