Geometry-Aware Rotary Position Embedding for Consistent Video World Model

📅 2026-02-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video world models struggle to maintain spatial consistency over long temporal horizons, often exhibiting geometric drift and detail hallucination upon revisiting scenes. This work proposes ViewRope—a geometry-aware rotary positional encoding that explicitly embeds camera ray directions into the self-attention mechanism of video Transformers. By integrating relative ray parameterization with frame-sparse attention, ViewRope introduces a 3D consistency inductive bias while improving computational efficiency. Notably, it is the first method to explicitly model ray geometry within attention computation, significantly enhancing scene consistency over extended trajectories. The approach demonstrates improved fidelity in closed-loop reconstruction, as validated on the ViewBench benchmark.

Technology Category

Application Category

📝 Abstract
Predictive world models that simulate future observations under explicit camera control are fundamental to interactive AI. Despite rapid advances, current systems lack spatial persistence: they fail to maintain stable scene structures over long trajectories, frequently hallucinating details when cameras revisit previously observed locations. We identify that this geometric drift stems from reliance on screen-space positional embeddings, which conflict with the projective geometry required for 3D consistency. We introduce \textbf{ViewRope}, a geometry-aware encoding that injects camera-ray directions directly into video transformer self-attention layers. By parameterizing attention with relative ray geometry rather than pixel locality, ViewRope provides a model-native inductive bias for retrieving 3D-consistent content across temporal gaps. We further propose \textbf{Geometry-Aware Frame-Sparse Attention}, which exploits these geometric cues to selectively attend to relevant historical frames, improving efficiency without sacrificing memory consistency. We also present \textbf{ViewBench}, a diagnostic suite measuring loop-closure fidelity and geometric drift. Our results demonstrate that ViewRope substantially improves long-term consistency while reducing computational costs.
Problem

Research questions and friction points this paper is trying to address.

spatial persistence
geometric drift
3D consistency
video world model
camera control
Innovation

Methods, ideas, or system contributions that make the work stand out.

ViewRope
geometry-aware attention
3D consistency
video world model
camera-ray encoding
🔎 Similar Papers
No similar papers found.