🤖 AI Summary
Existing natural language–driven 2D/3D layout generation methods rely on implicit modeling of object joint distributions and relational structures, resulting in poor controllability and low fidelity. To address this, we propose a semantic graph prior mechanism that explicitly decouples appearance representations from spatial distributions, enabling an instruction-conditioned layout decoder. We further integrate large language models (LLMs) and multimodal foundation models to automatically construct a high-quality, instruction–layout paired benchmark—constituting the first publicly available dataset of its kind. Our framework supports zero-shot generalization across tasks and dimensions (2D ↔ 3D). Experiments demonstrate significant improvements over state-of-the-art methods across multiple layout synthesis benchmarks. Ablation studies confirm the critical roles of the semantic graph prior and the co-designed data construction pipeline. Overall, our approach achieves highly controllable, high-fidelity, and dimensionally unified layout generation.
📝 Abstract
Comprehending natural language instructions is a charming property for both 2D and 3D layout synthesis systems. Existing methods implicitly model object joint distributions and express object relations, hindering generation's controllability. We introduce InstructLayout, a novel generative framework that integrates a semantic graph prior and a layout decoder to improve controllability and fidelity for 2D and 3D layout synthesis. The proposed semantic graph prior learns layout appearances and object distributions simultaneously, demonstrating versatility across various downstream tasks in a zero-shot manner. To facilitate the benchmarking for text-driven 2D and 3D scene synthesis, we respectively curate two high-quality datasets of layout-instruction pairs from public Internet resources with large language and multimodal models. Extensive experimental results reveal that the proposed method outperforms existing state-of-the-art approaches by a large margin in both 2D and 3D layout synthesis tasks. Thorough ablation studies confirm the efficacy of crucial design components.