🤖 AI Summary
Current large language models often produce erroneous layouts and object collisions in 3D indoor scene generation due to inadequate spatial representations. To address this, this work proposes SpatialGrammar—a compilable, domain-specific language (DSL) tailored for 3D indoor scenes—that encodes spatial structure through a gravity-aligned top-down grid representation and deterministically compiles into collision-free 3D geometry. Leveraging this DSL, we develop SG-Agent, a closed-loop optimization system, and SG-Mini, a lightweight 104M-parameter model, which together enable efficient scene generation trained exclusively on synthetic data for the first time. Experiments demonstrate that SG-Agent substantially improves spatial fidelity and physical plausibility, while SG-Mini matches or exceeds the performance of significantly larger LLM baselines in single-pass generation.
📝 Abstract
Automatically generating interactive 3D indoor scenes from natural language is crucial for virtual reality, gaming, and embodied AI. However, existing LLM-based approaches often suffer from spatial errors and collisions, in part because common scene representations-raw coordinates or verbose code-are difficult for models to reason about 3D spatial relationships and physical constraints. We propose SpatialGrammar, a domain-specific language that represents gravity-aligned indoor layouts as BEV grid placements with deterministic compilation to valid 3D geometry, enabling verifiable constraint checking. Building on this representation, we develop (1) SG-Agent, a closed-loop system that uses compiler feedback to iteratively refine scenes and enforce collision constraints, and (2) SG-Mini, a 104M-parameter model trained entirely on compiler-validated synthetic data. Across 159 test scenes spanning five scenarios of different complexity, SG-Agent improves spatial fidelity and physical plausibility over prior methods, while SG-Mini performs competitively against larger LLM-based baselines on single-shot generation scenarios.