🤖 AI Summary
Existing text-to-3D approaches struggle to generate large-scale, geometrically consistent indoor scenes that faithfully adhere to both user-provided textual descriptions and spatial layout preferences—particularly due to their reliance on single-room assumptions and limited control over shape and texture. This work introduces the first rendering-guided paradigm for converting semantic layouts into multi-view proxy images, enabling text- and layout-driven, end-to-end generation of apartment-scale, multi-room 3D scenes. Our method integrates 3D semantic layout rendering, semantic- and depth-conditioned diffusion modeling, and multi-view image-guided NeRF optimization. By preserving geometric consistency throughout the pipeline, it significantly enhances texture diversity and visual realism. Notably, it is the first method to support high-fidelity, end-to-end synthesis of irregularly structured, multi-bedroom apartments—overcoming longstanding limitations in controllability, scalability, and scene complexity.
📝 Abstract
The creation of complex 3D scenes tailored to user specifications has been a tedious and challenging task with traditional 3D modeling tools. Although some pioneering methods have achieved automatic text-to-3D generation, they are generally limited to small-scale scenes with restricted control over the shape and texture. We introduce SceneCraft, a novel method for generating detailed indoor scenes that adhere to textual descriptions and spatial layout preferences provided by users. Central to our method is a rendering-based technique, which converts 3D semantic layouts into multi-view 2D proxy maps. Furthermore, we design a semantic and depth conditioned diffusion model to generate multi-view images, which are used to learn a neural radiance field (NeRF) as the final scene representation. Without the constraints of panorama image generation, we surpass previous methods in supporting complicated indoor space generation beyond a single room, even as complicated as a whole multi-bedroom apartment with irregular shapes and layouts. Through experimental analysis, we demonstrate that our method significantly outperforms existing approaches in complex indoor scene generation with diverse textures, consistent geometry, and realistic visual quality. Code and more results are available at: https://orangesodahub.github.io/SceneCraft