Decorum: A Language-Based Approach For Style-Conditioned Synthesis of Indoor 3D Scenes

📅 2025-03-23

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Existing 3D indoor scene generation methods struggle to jointly model visual features and user-defined stylistic preferences. To address this, we propose the first fully language-based representation framework: unifying scene layout, furniture selection, and style control into structured linguistic instructions within a shared text space, enabling “text-to-text” semantic mapping via large language models (LLMs). We further introduce a novel text-driven furniture retrieval paradigm grounded in multimodal large language models (MLLMs), integrating text-to-layout generation with semantically aligned cross-modal retrieval. This enables fine-grained, joint control over stylistic attributes and spatial relationships. Evaluated on the 3D-FRONT benchmark, our method significantly improves text-conditioned scene synthesis quality and furniture retrieval accuracy. It is the first approach to achieve end-to-end, style-controllable, semantically precise, and layout-coherent 3D indoor scene generation.

Technology Category

Application Category

📝 Abstract

3D indoor scene generation is an important problem for the design of digital and real-world environments. To automate this process, a scene generation model should be able to not only generate plausible scene layouts, but also take into consideration visual features and style preferences. Existing methods for this task exhibit very limited control over these attributes, only allowing text inputs in the form of simple object-level descriptions or pairwise spatial relationships. Our proposed method Decorum enables users to control the scene generation process with natural language by adopting language-based representations at each stage. This enables us to harness recent advancements in Large Language Models (LLMs) to model language-to-language mappings. In addition, we show that using a text-based representation allows us to select furniture for our scenes using a novel object retrieval method based on multimodal LLMs. Evaluations on the benchmark 3D-FRONT dataset show that our methods achieve improvements over existing work in text-conditioned scene synthesis and object retrieval.

Problem

Research questions and friction points this paper is trying to address.

Automate 3D indoor scene generation with style control

Enhance text-conditioned scene synthesis using language models

Improve furniture retrieval via multimodal LLM-based methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Language-based representations for scene generation

Multimodal LLMs for object retrieval

Text-conditioned synthesis with natural language

🔎 Similar Papers

No similar papers found.