VideoPanda: Video Panoramic Diffusion with Multi-view Attention

📅 2025-04-15

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

High-resolution 360° omnidirectional video capture typically relies on costly, non-scalable multi-camera rigs. To address this, we propose a single-input video generation framework based on diffusion models. Our method introduces three key innovations: (1) a novel multi-view attention mechanism that explicitly enforces geometric consistency across spherical views; (2) joint conditioning on both text prompts and a single conventional video input, enabling semantically controllable multi-view synthesis; and (3) a random spatiotemporal sub-sampling scheme coupled with cross-scale autoregressive modeling to enhance temporal coherence and structural fidelity in long sequences. Evaluated on both synthetic and real-world datasets, our approach achieves state-of-the-art performance in image quality, motion consistency, and spherical geometry preservation—outperforming existing methods by significant margins.

Technology Category

Application Category

📝 Abstract

High resolution panoramic video content is paramount for immersive experiences in Virtual Reality, but is non-trivial to collect as it requires specialized equipment and intricate camera setups. In this work, we introduce VideoPanda, a novel approach for synthesizing 360$^circ$ videos conditioned on text or single-view video data. VideoPanda leverages multi-view attention layers to augment a video diffusion model, enabling it to generate consistent multi-view videos that can be combined into immersive panoramic content. VideoPanda is trained jointly using two conditions: text-only and single-view video, and supports autoregressive generation of long-videos. To overcome the computational burden of multi-view video generation, we randomly subsample the duration and camera views used during training and show that the model is able to gracefully generalize to generating more frames during inference. Extensive evaluations on both real-world and synthetic video datasets demonstrate that VideoPanda generates more realistic and coherent 360$^circ$ panoramas across all input conditions compared to existing methods. Visit the project website at https://research-staging.nvidia.com/labs/toronto-ai/VideoPanda/ for results.

Problem

Research questions and friction points this paper is trying to address.

Generates high-resolution 360° videos from text or single-view inputs

Ensures consistent multi-view synthesis for immersive VR content

Reduces computational cost via subsampled training for scalable generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-view attention enhances video diffusion model

Joint training with text and single-view video

Random subsampling reduces computational burden

🔎 Similar Papers

Pyramidal Flow Matching for Efficient Video Generative Modeling