🤖 AI Summary
This work addresses the challenge of simultaneously achieving high visual fidelity and structural consistency in text-driven immersive 3D scene generation, where existing approaches suffer from contextual drift in autoregressive expansion or limited resolution in panoramic video synthesis. To overcome these limitations, we propose a staged panoramic scene expansion framework that integrates a multi-view 360° diffusion model with a geometric reconstruction pipeline, enabling the generation of semantically consistent, high-resolution, and explorable 3D environments. Our key contributions include a novel staged expansion mechanism, joint optimization of multi-view diffusion and geometric constraints, and the release of a large-scale multi-view panoramic dataset. Experiments demonstrate that our method significantly outperforms current state-of-the-art techniques in both visual quality and structural coherence, establishing a new benchmark for text-to-3D scene generation.
📝 Abstract
The synthesis of immersive 3D scenes from text is rapidly maturing, driven by novel video generative models and feed-forward 3D reconstruction, with vast potential in AR/VR and world modeling. While panoramic images have proven effective for scene initialization, existing approaches suffer from a trade-off between visual fidelity and explorability: autoregressive expansion suffers from context drift, while panoramic video generation is limited to low resolution. We present Stepper, a unified framework for text-driven immersive 3D scene synthesis that circumvents these limitations via stepwise panoramic scene expansion. Stepper leverages a novel multi-view 360° diffusion model that enables consistent, high-resolution expansion, coupled with a geometry reconstruction pipeline that enforces geometric coherence. Trained on a new large-scale, multi-view panorama dataset, Stepper achieves state-of-the-art fidelity and structural consistency, outperforming prior approaches, thereby setting a new standard for immersive scene generation.