🤖 AI Summary
Existing text-to-3D scene generation methods struggle to simultaneously ensure scene completeness and navigability. This paper proposes the first end-to-end framework for generating fully traversable 3D scenes from text. It employs a 128-frame 360° panoramic video as an intermediate representation and introduces the first conditional diffusion model for controllable 360° video generation, jointly optimized with feedforward 3D Gaussian splatting for photorealistic geometry and texture reconstruction. To ensure coherence, we incorporate multi-frame view consistency constraints and a text–video–3D cross-modal alignment mechanism. Our method surpasses state-of-the-art approaches in scene completeness (enabling free navigation over >10 m²), view consistency, and reconstruction fidelity: panoramic video FID improves by 23%, and 3D reconstruction PSNR increases by 5.1 dB.
📝 Abstract
Scene-level 3D generation is a challenging research topic, with most existing methods generating only partial scenes and offering limited navigational freedom. We introduce WorldPrompter, a novel generative pipeline for synthesizing traversable 3D scenes from text prompts. We leverage panoramic videos as an intermediate representation to model the 360{deg} details of a scene. WorldPrompter incorporates a conditional 360{deg} panoramic video generator, capable of producing a 128-frame video that simulates a person walking through and capturing a virtual environment. The resulting video is then reconstructed as Gaussian splats by a fast feedforward 3D reconstructor, enabling a true walkable experience within the 3D scene. Experiments demonstrate that our panoramic video generation model achieves convincing view consistency across frames, enabling high-quality panoramic Gaussian splat reconstruction and facilitating traversal over an area of the scene. Qualitative and quantitative results also show it outperforms the state-of-the-art 360{deg} video generators and 3D scene generation models.