🤖 AI Summary
High-resolution 360° omnidirectional video capture typically relies on costly, non-scalable multi-camera rigs. To address this, we propose a single-input video generation framework based on diffusion models. Our method introduces three key innovations: (1) a novel multi-view attention mechanism that explicitly enforces geometric consistency across spherical views; (2) joint conditioning on both text prompts and a single conventional video input, enabling semantically controllable multi-view synthesis; and (3) a random spatiotemporal sub-sampling scheme coupled with cross-scale autoregressive modeling to enhance temporal coherence and structural fidelity in long sequences. Evaluated on both synthetic and real-world datasets, our approach achieves state-of-the-art performance in image quality, motion consistency, and spherical geometry preservation—outperforming existing methods by significant margins.
📝 Abstract
High resolution panoramic video content is paramount for immersive experiences in Virtual Reality, but is non-trivial to collect as it requires specialized equipment and intricate camera setups. In this work, we introduce VideoPanda, a novel approach for synthesizing 360$^circ$ videos conditioned on text or single-view video data. VideoPanda leverages multi-view attention layers to augment a video diffusion model, enabling it to generate consistent multi-view videos that can be combined into immersive panoramic content. VideoPanda is trained jointly using two conditions: text-only and single-view video, and supports autoregressive generation of long-videos. To overcome the computational burden of multi-view video generation, we randomly subsample the duration and camera views used during training and show that the model is able to gracefully generalize to generating more frames during inference. Extensive evaluations on both real-world and synthetic video datasets demonstrate that VideoPanda generates more realistic and coherent 360$^circ$ panoramas across all input conditions compared to existing methods. Visit the project website at https://research-staging.nvidia.com/labs/toronto-ai/VideoPanda/ for results.