🤖 AI Summary
Existing methods struggle to generate high-fidelity, full-view-consistent dynamic panoramic 4D scenes, typically being limited to static content or narrow-field-of-view videos. To address this, we propose a dual-branch generative framework that jointly performs panoramic video synthesis and dynamic scene reconstruction. The front branch enables fine-grained spatiotemporal control via bidirectional cross-attention; the back branch leverages metric depth maps to guide geometric alignment of 3D Gaussian splatting point clouds and jointly optimizes camera poses. To our knowledge, this is the first method achieving geometrically consistent, motion-coherent, and view-invariant immersive panoramic 4D scene generation. Extensive experiments demonstrate significant improvements over state-of-the-art static and narrow-FOV approaches in visual realism, temporal consistency, and geometric stability. Our work establishes a new paradigm for constructing 360° dynamic virtual environments.
📝 Abstract
With the rapid advancement and widespread adoption of VR/AR technologies, there is a growing demand for the creation of high-quality, immersive dynamic scenes. However, existing generation works predominantly concentrate on the creation of static scenes or narrow perspective-view dynamic scenes, falling short of delivering a truly 360-degree immersive experience from any viewpoint. In this paper, we introduce extbf{TiP4GEN}, an advanced text-to-dynamic panorama scene generation framework that enables fine-grained content control and synthesizes motion-rich, geometry-consistent panoramic 4D scenes. TiP4GEN integrates panorama video generation and dynamic scene reconstruction to create 360-degree immersive virtual environments. For video generation, we introduce a extbf{Dual-branch Generation Model} consisting of a panorama branch and a perspective branch, responsible for global and local view generation, respectively. A bidirectional cross-attention mechanism facilitates comprehensive information exchange between the branches. For scene reconstruction, we propose a extbf{Geometry-aligned Reconstruction Model} based on 3D Gaussian Splatting. By aligning spatial-temporal point clouds using metric depth maps and initializing scene cameras with estimated poses, our method ensures geometric consistency and temporal coherence for the reconstructed scenes. Extensive experiments demonstrate the effectiveness of our proposed designs and the superiority of TiP4GEN in generating visually compelling and motion-coherent dynamic panoramic scenes. Our project page is at https://ke-xing.github.io/TiP4GEN/.