SceneNAT: Masked Generative Modeling for Language-Guided Indoor Scene Synthesis

📅 2026-01-12

📈 Citations: 1

✨ Influential: 0

career value

156K/year

🤖 AI Summary

This work addresses the challenge of efficiently generating semantically coherent and spatially accurate full 3D indoor scenes from natural language instructions. To this end, the authors propose a single-stage, non-autoregressive Transformer model that directly synthesizes scenes from text through parallel decoding and a fully discretized semantic-spatial representation. Key innovations include a dual masking strategy operating at both attribute and instance levels, as well as a learnable mapping mechanism that translates relational queries into symbolic triplets, significantly enhancing inter-object relationship modeling. Experiments on the 3D-FRONT dataset demonstrate that the proposed method outperforms existing autoregressive and diffusion-based approaches in both semantic plausibility and spatial layout accuracy, while substantially reducing computational overhead.

Technology Category

Application Category

📝 Abstract

We present SceneNAT, a single-stage masked non-autoregressive Transformer that synthesizes complete 3D indoor scenes from natural language instructions through only a few parallel decoding passes, offering improved performance and efficiency compared to prior state-of-the-art approaches. SceneNAT is trained via masked modeling over fully discretized representations of both semantic and spatial attributes. By applying a masking strategy at both the attribute level and the instance level, the model can better capture intra-object and inter-object structure. To boost relational reasoning, SceneNAT employs a dedicated triplet predictor for modeling the scene's layout and object relationships by mapping a set of learnable relation queries to a sparse set of symbolic triplets (subject, predicate, object). Extensive experiments on the 3D-FRONT dataset demonstrate that SceneNAT achieves superior performance compared to state-of-the-art autoregressive and diffusion baselines in both semantic compliance and spatial arrangement accuracy, while operating with substantially lower computational cost.

Problem

Research questions and friction points this paper is trying to address.

language-guided scene synthesis

3D indoor scene generation

semantic compliance

spatial arrangement

natural language instructions

Innovation

Methods, ideas, or system contributions that make the work stand out.

masked generative modeling

non-autoregressive Transformer

language-guided scene synthesis