🤖 AI Summary
Existing layout-to-image generation methods lack explicit modeling of occlusion order (Z-order) when handling overlapping objects, often resulting in ambiguous textures and incorrect layering in intersecting regions. This work addresses this limitation by explicitly incorporating Z-order into layout-guided image synthesis for the first time and introduces SA-Z, the first large-scale dataset with pixel-level occlusion annotations. Building upon this foundation, the authors propose OcclusionFormer, a diffusion-based Transformer framework that leverages volume rendering to disentangle object instances and synthesize images, enhanced by a query alignment loss to improve spatial precision and semantic consistency. The approach substantially reduces ambiguity in overlapping regions and consistently enhances structural accuracy and visual plausibility across diverse scenes.
📝 Abstract
Recent layout-to-image models have achieved remarkable progress in spatial controllability. However, they still struggle with inter-object occlusion. When bounding boxes overlap, most existing methods lack explicit occlusion information, which makes the generation in intersection regions inherently ambiguous and hinders the determination of complex occlusion relationships. As a result, they often produce entangled textures or physically inconsistent layering in the overlapped areas. To address this issue, we first construct SA-Z, a large-scale dataset enriched with explicit occlusion ordering and pixel-level annotations. Building upon our proposed dataset, we introduce OcclusionFormer, a novel occlusion-aware Diffusion Transformer framework that explicitly models Z-order priority by decoupling instances and compositing them via volume rendering. Furthermore, to ensure fine-grained spatial precision, we introduce a queried alignment loss that explicitly supervises individual instances and enhances semantic consistency. The proposed method effectively reduces ambiguity in overlapping regions, enforces correct occlusion dependencies, and preserves structural integrity, leading to substantial accuracy gains across diverse scenes.