Towards Spatially Consistent Image Generation: On Incorporating Intrinsic Scene Properties into Diffusion Models

📅 2025-08-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current image generation models often produce spatially inconsistent outputs and geometric distortions due to the lack of explicit scene-structure modeling. To address this, we propose a collaborative denoising framework that jointly generates images and their intrinsic attributes—namely depth maps and semantic segmentation masks—thereby implicitly learning geometric and layout constraints through a shared latent space. Our method builds upon a pre-trained latent diffusion model and employs a lightweight autoencoder to fuse multi-source intrinsic attributes as structural priors. A cross-domain information-sharing mechanism enables synchronized denoising across the image and attribute domains, requiring neither 3D supervision nor additional annotations. Experiments demonstrate that our approach preserves text-image alignment and visual fidelity while significantly improving spatial plausibility. It achieves state-of-the-art performance across quantitative metrics—including layout consistency and depth fidelity—outperforming all baseline methods.

Technology Category

Application Category

📝 Abstract
Image generation models trained on large datasets can synthesize high-quality images but often produce spatially inconsistent and distorted images due to limited information about the underlying structures and spatial layouts. In this work, we leverage intrinsic scene properties (e.g., depth, segmentation maps) that provide rich information about the underlying scene, unlike prior approaches that solely rely on image-text pairs or use intrinsics as conditional inputs. Our approach aims to co-generate both images and their corresponding intrinsics, enabling the model to implicitly capture the underlying scene structure and generate more spatially consistent and realistic images. Specifically, we first extract rich intrinsic scene properties from a large image dataset with pre-trained estimators, eliminating the need for additional scene information or explicit 3D representations. We then aggregate various intrinsic scene properties into a single latent variable using an autoencoder. Building upon pre-trained large-scale Latent Diffusion Models (LDMs), our method simultaneously denoises the image and intrinsic domains by carefully sharing mutual information so that the image and intrinsic reflect each other without degrading image quality. Experimental results demonstrate that our method corrects spatial inconsistencies and produces a more natural layout of scenes while maintaining the fidelity and textual alignment of the base model (e.g., Stable Diffusion).
Problem

Research questions and friction points this paper is trying to address.

Improving spatial consistency in image generation models
Co-generating images and intrinsic scene properties
Enhancing scene structure without degrading image quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Co-generate images and intrinsics for consistency
Aggregate intrinsics into latent via autoencoder
Denoise image and intrinsics domains jointly
🔎 Similar Papers
No similar papers found.