🤖 AI Summary
Existing text-to-3D scene generation methods suffer from ambiguous text descriptions and coarse-grained spatial control, resulting in implausible layouts and physically inconsistent scenes. To address this, we propose a two-stage diffusion-based framework conditioned on 3D semantic layouts: (1) a semantic-guided geometric diffusion model refines scene structure in the first stage; (2) a semantic-geometry joint conditioning scheme drives an appearance diffusion model to synthesize high-fidelity surface details in the second stage. Our approach decouples object and background representations, synergistically integrating pretrained text-to-3D priors with 2D diffusion priors, and enables fine-grained, object-level editing. Quantitative and qualitative evaluations demonstrate state-of-the-art performance across multiple metrics—particularly in physical plausibility and semantic alignment—while enabling practical applications in virtual reality and autonomous driving simulation.
📝 Abstract
3D scene generation conditioned on text prompts has significantly progressed due to the development of 2D diffusion generation models. However, the textual description of 3D scenes is inherently inaccurate and lacks fine-grained control during training, leading to implausible scene generation. As an intuitive and feasible solution, the 3D layout allows for precise specification of object locations within the scene. To this end, we present a text-to-scene generation method (namely, Layout2Scene) using additional semantic layout as the prompt to inject precise control of 3D object positions. Specifically, we first introduce a scene hybrid representation to decouple objects and backgrounds, which is initialized via a pre-trained text-to-3D model. Then, we propose a two-stage scheme to optimize the geometry and appearance of the initialized scene separately. To fully leverage 2D diffusion priors in geometry and appearance generation, we introduce a semantic-guided geometry diffusion model and a semantic-geometry guided diffusion model which are finetuned on a scene dataset. Extensive experiments demonstrate that our method can generate more plausible and realistic scenes as compared to state-of-the-art approaches. Furthermore, the generated scene allows for flexible yet precise editing, thereby facilitating multiple downstream applications.