TiP4GEN: Text to Immersive Panorama 4D Scene Generation

📅 2025-08-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing methods struggle to generate high-fidelity, full-view-consistent dynamic panoramic 4D scenes, typically being limited to static content or narrow-field-of-view videos. To address this, we propose a dual-branch generative framework that jointly performs panoramic video synthesis and dynamic scene reconstruction. The front branch enables fine-grained spatiotemporal control via bidirectional cross-attention; the back branch leverages metric depth maps to guide geometric alignment of 3D Gaussian splatting point clouds and jointly optimizes camera poses. To our knowledge, this is the first method achieving geometrically consistent, motion-coherent, and view-invariant immersive panoramic 4D scene generation. Extensive experiments demonstrate significant improvements over state-of-the-art static and narrow-FOV approaches in visual realism, temporal consistency, and geometric stability. Our work establishes a new paradigm for constructing 360° dynamic virtual environments.

Technology Category

Application Category

📝 Abstract
With the rapid advancement and widespread adoption of VR/AR technologies, there is a growing demand for the creation of high-quality, immersive dynamic scenes. However, existing generation works predominantly concentrate on the creation of static scenes or narrow perspective-view dynamic scenes, falling short of delivering a truly 360-degree immersive experience from any viewpoint. In this paper, we introduce extbf{TiP4GEN}, an advanced text-to-dynamic panorama scene generation framework that enables fine-grained content control and synthesizes motion-rich, geometry-consistent panoramic 4D scenes. TiP4GEN integrates panorama video generation and dynamic scene reconstruction to create 360-degree immersive virtual environments. For video generation, we introduce a extbf{Dual-branch Generation Model} consisting of a panorama branch and a perspective branch, responsible for global and local view generation, respectively. A bidirectional cross-attention mechanism facilitates comprehensive information exchange between the branches. For scene reconstruction, we propose a extbf{Geometry-aligned Reconstruction Model} based on 3D Gaussian Splatting. By aligning spatial-temporal point clouds using metric depth maps and initializing scene cameras with estimated poses, our method ensures geometric consistency and temporal coherence for the reconstructed scenes. Extensive experiments demonstrate the effectiveness of our proposed designs and the superiority of TiP4GEN in generating visually compelling and motion-coherent dynamic panoramic scenes. Our project page is at https://ke-xing.github.io/TiP4GEN/.
Problem

Research questions and friction points this paper is trying to address.

Generates immersive 360-degree dynamic scenes from text
Ensures geometry consistency in panoramic 4D scenes
Combines panorama video generation with dynamic reconstruction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-branch Generation Model for panorama and perspective views
Geometry-aligned Reconstruction Model using 3D Gaussian Splatting
Bidirectional cross-attention for comprehensive information exchange
🔎 Similar Papers
No similar papers found.
K
Ke Xing
Institute of Information Science, Beijing Jiaotong University, Visual Intelligence + X International Joint Laboratory, Beijing, China
Hanwen Liang
Hanwen Liang
University of Toronto
Dejia Xu
Dejia Xu
University of Texas at Austin
computer vision
Yuyang Yin
Yuyang Yin
Beijing Jiaotong University
Computer VisionAIGC
K
Konstantinos N. Plataniotis
University of Toronto, Toronto, Canada
Y
Yao Zhao
Institute of Information Science, Beijing Jiaotong University, Visual Intelligence + X International Joint Laboratory, Beijing, China
Yunchao Wei
Yunchao Wei
Professor, Beijing Jiaotong University, UTS, UIUC, NUS
Computer VisionMachine Learning