🤖 AI Summary
This work addresses the challenge of efficiently generating semantically coherent and spatially accurate full 3D indoor scenes from natural language instructions. To this end, the authors propose a single-stage, non-autoregressive Transformer model that directly synthesizes scenes from text through parallel decoding and a fully discretized semantic-spatial representation. Key innovations include a dual masking strategy operating at both attribute and instance levels, as well as a learnable mapping mechanism that translates relational queries into symbolic triplets, significantly enhancing inter-object relationship modeling. Experiments on the 3D-FRONT dataset demonstrate that the proposed method outperforms existing autoregressive and diffusion-based approaches in both semantic plausibility and spatial layout accuracy, while substantially reducing computational overhead.
📝 Abstract
We present SceneNAT, a single-stage masked non-autoregressive Transformer that synthesizes complete 3D indoor scenes from natural language instructions through only a few parallel decoding passes, offering improved performance and efficiency compared to prior state-of-the-art approaches. SceneNAT is trained via masked modeling over fully discretized representations of both semantic and spatial attributes. By applying a masking strategy at both the attribute level and the instance level, the model can better capture intra-object and inter-object structure. To boost relational reasoning, SceneNAT employs a dedicated triplet predictor for modeling the scene's layout and object relationships by mapping a set of learnable relation queries to a sparse set of symbolic triplets (subject, predicate, object). Extensive experiments on the 3D-FRONT dataset demonstrate that SceneNAT achieves superior performance compared to state-of-the-art autoregressive and diffusion baselines in both semantic compliance and spatial arrangement accuracy, while operating with substantially lower computational cost.