🤖 AI Summary
This work addresses three key challenges in dynamic video re-rendering: (1) difficulty in spatiotemporal alignment, (2) misapplication of RoPE (Rotary Position Embedding) for camera-conditioned modeling, and (3) poor generalization to variable-length videos. To this end, we propose Rotary Camera Encoding (RoCE), a novel camera-conditioned positional encoding mechanism. RoCE uniquely incorporates camera pose parameters into the phase shift of RoPE, enabling robust modeling of out-of-distribution camera trajectories and arbitrarily long videos. By explicitly encoding multi-view geometric relationships between input and target videos, RoCE significantly improves dynamic object localization accuracy and background consistency. Integrated into Transformer-based architectures, RoCE ensures spatiotemporally coherent generation. Extensive experiments demonstrate that our method consistently outperforms state-of-the-art approaches across diverse camera motions and video lengths, achieving new SOTA performance in camera controllability, geometric consistency, and visual fidelity.
📝 Abstract
We present ReDirector, a novel camera-controlled video retake generation method for dynamically captured variable-length videos. In particular, we rectify a common misuse of RoPE in previous works by aligning the spatiotemporal positions of the input video and the target retake. Moreover, we introduce Rotary Camera Encoding (RoCE), a camera-conditioned RoPE phase shift that captures and integrates multi-view relationships within and across the input and target videos. By integrating camera conditions into RoPE, our method generalizes to out-of-distribution camera trajectories and video lengths, yielding improved dynamic object localization and static background preservation. Extensive experiments further demonstrate significant improvements in camera controllability, geometric consistency, and video quality across various trajectories and lengths.