🤖 AI Summary
Existing indoor 3D scene layout generation methods focus predominantly on large furniture, neglecting small objects—leading to sparse, geometrically distorted scenes that fail to satisfy dense spatial arrangements specified in text descriptions. To address this, we propose the Hierarchical Scene Primitive (HSP) framework, the first to explicitly model cross-scale spatial dependencies between surfaces and objects, as well as recurring layout patterns across scales, enabling coherent multi-scale generation—from floor-level furniture to tabletop small objects. Our approach integrates a hierarchical graph neural network with a conditional diffusion model to jointly encode surface semantics, geometric constraints, and multi-granularity layout priors. Evaluated across diverse room types and layout configurations, our method achieves significant improvements in visual plausibility and text-layout alignment over state-of-the-art approaches.
📝 Abstract
Despite advances in indoor 3D scene layout generation, synthesizing scenes with dense object arrangements remains challenging. Existing methods primarily focus on large furniture while neglecting smaller objects, resulting in unrealistically empty scenes. Those that place small objects typically do not honor arrangement specifications, resulting in largely random placement not following the text description. We present HSM, a hierarchical framework for indoor scene generation with dense object arrangements across spatial scales. Indoor scenes are inherently hierarchical, with surfaces supporting objects at different scales, from large furniture on floors to smaller objects on tables and shelves. HSM embraces this hierarchy and exploits recurring cross-scale spatial patterns to generate complex and realistic indoor scenes in a unified manner. Our experiments show that HSM outperforms existing methods by generating scenes that are more realistic and better conform to user input across room types and spatial configurations.