π€ AI Summary
Existing 3D indoor scene generation methods struggle to jointly model visual features and user-defined stylistic preferences. To address this, we propose the first fully language-based representation framework: unifying scene layout, furniture selection, and style control into structured linguistic instructions within a shared text space, enabling βtext-to-textβ semantic mapping via large language models (LLMs). We further introduce a novel text-driven furniture retrieval paradigm grounded in multimodal large language models (MLLMs), integrating text-to-layout generation with semantically aligned cross-modal retrieval. This enables fine-grained, joint control over stylistic attributes and spatial relationships. Evaluated on the 3D-FRONT benchmark, our method significantly improves text-conditioned scene synthesis quality and furniture retrieval accuracy. It is the first approach to achieve end-to-end, style-controllable, semantically precise, and layout-coherent 3D indoor scene generation.
π Abstract
3D indoor scene generation is an important problem for the design of digital and real-world environments. To automate this process, a scene generation model should be able to not only generate plausible scene layouts, but also take into consideration visual features and style preferences. Existing methods for this task exhibit very limited control over these attributes, only allowing text inputs in the form of simple object-level descriptions or pairwise spatial relationships. Our proposed method Decorum enables users to control the scene generation process with natural language by adopting language-based representations at each stage. This enables us to harness recent advancements in Large Language Models (LLMs) to model language-to-language mappings. In addition, we show that using a text-based representation allows us to select furniture for our scenes using a novel object retrieval method based on multimodal LLMs. Evaluations on the benchmark 3D-FRONT dataset show that our methods achieve improvements over existing work in text-conditioned scene synthesis and object retrieval.