CasLayout: Cascaded 3D Layout Diffusion for Indoor Scene Synthesis with Implicit Relation Modeling

📅 2026-04-29

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

This work addresses the challenges of data scarcity and the difficulty of simultaneously satisfying global architectural constraints and local semantic consistency in 3D indoor scene synthesis. To this end, the authors propose a cascaded diffusion framework that decomposes the generation process into four conditional sub-stages: furniture count and category prediction, object size and feature refinement, latent spatial relationship modeling, and oriented bounding box generation, with explicit integration of architectural elements as physical constraints. Innovatively, a sparse relation graph aligns with human spatial descriptions, while a bidirectional VAE enables controllable relationship modeling. Furthermore, the framework incorporates large language and vision models to support zero-shot image-to-scene generation. The method achieves state-of-the-art performance in terms of generation fidelity, diversity, and controllability.

📝 Abstract

Synthesizing realistic 3D indoor scenes remains challenging due to data scarcity and the difficulty of simultaneously enforcing global architectural constraints and local semantic consistency. Existing approaches often overlook structural boundaries or rely on fully connected relation graphs that introduce redundant generation errors. Inspired by human design cognition, we present CasLayout, a cascaded diffusion framework that decomposes the joint scene generation task into four conditional sub-stages with explicit physical and semantic roles: (1) predicting furniture quantity and categories, (2) refining object sizes and feature embeddings, (3) modeling spatial relationships in a latent space, and (4) generating Oriented Bounding Boxes (OBBs). This decoupled architecture reduces data requirements and enables flexible integration of Large Language Models (LLMs) and Vision Language Models (VLMs) for zero-shot tasks such as image-to-scene generation. To maintain physical validity within complex floor plans, we explicitly model building elements (e.g., walls, doors, and windows) as conditional constraints. Furthermore, to address the high entropy of dense relation graphs, we introduce a sparse relation graph formulation aligned with human spatial descriptions. By encoding these sparse graphs into a compact latent space using a bidirectional Variational Autoencoder (VAE), the proposed framework provides enhanced relational controllability, allowing generated layouts to better respect functional organization. Experiments demonstrate that CasLayout achieves state-of-the-art performance in fidelity and diversity while enabling improved controllability in practical applications.

Problem

Research questions and friction points this paper is trying to address.

3D indoor scene synthesis

data scarcity

global architectural constraints

local semantic consistency

relation modeling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cascaded Diffusion

Sparse Relation Graph

Implicit Relation Modeling