Stepper: Stepwise Immersive Scene Generation with Multiview Panoramas

📅 2026-03-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of simultaneously achieving high visual fidelity and structural consistency in text-driven immersive 3D scene generation, where existing approaches suffer from contextual drift in autoregressive expansion or limited resolution in panoramic video synthesis. To overcome these limitations, we propose a staged panoramic scene expansion framework that integrates a multi-view 360° diffusion model with a geometric reconstruction pipeline, enabling the generation of semantically consistent, high-resolution, and explorable 3D environments. Our key contributions include a novel staged expansion mechanism, joint optimization of multi-view diffusion and geometric constraints, and the release of a large-scale multi-view panoramic dataset. Experiments demonstrate that our method significantly outperforms current state-of-the-art techniques in both visual quality and structural coherence, establishing a new benchmark for text-to-3D scene generation.
📝 Abstract
The synthesis of immersive 3D scenes from text is rapidly maturing, driven by novel video generative models and feed-forward 3D reconstruction, with vast potential in AR/VR and world modeling. While panoramic images have proven effective for scene initialization, existing approaches suffer from a trade-off between visual fidelity and explorability: autoregressive expansion suffers from context drift, while panoramic video generation is limited to low resolution. We present Stepper, a unified framework for text-driven immersive 3D scene synthesis that circumvents these limitations via stepwise panoramic scene expansion. Stepper leverages a novel multi-view 360° diffusion model that enables consistent, high-resolution expansion, coupled with a geometry reconstruction pipeline that enforces geometric coherence. Trained on a new large-scale, multi-view panorama dataset, Stepper achieves state-of-the-art fidelity and structural consistency, outperforming prior approaches, thereby setting a new standard for immersive scene generation.
Problem

Research questions and friction points this paper is trying to address.

immersive 3D scene generation
visual fidelity
explorability
context drift
panoramic video generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

stepwise panoramic expansion
multi-view 360° diffusion model
geometry reconstruction
text-to-3D scene generation
immersive scene synthesis
🔎 Similar Papers