🤖 AI Summary
Existing multi-layer image generation methods struggle to simultaneously ensure global layout coherence, physically plausible inter-layer interactions (e.g., shadows, reflections), and high-fidelity transparency. To address this, we propose PSDiffusion—the first end-to-end unified diffusion framework capable of jointly synthesizing an RGB background and multiple RGBA foreground layers in a single forward pass. Its core innovations are: (1) a global-interlayer diffusion mechanism that replaces conventional sequential layer generation and post-hoc decomposition; (2) joint latent-space modeling across layers with cross-layer attention for spatial-visual alignment; and (3) a dual-path conditional control architecture decoupling layout and appearance guidance. Experiments demonstrate that PSDiffusion significantly improves alpha-matting accuracy and physical plausibility, achieving state-of-the-art performance on multi-layer compositing tasks—including complex occlusion, soft shadows, and transparent reflections—while preserving photorealistic fidelity.
📝 Abstract
Diffusion models have made remarkable advancements in generating high-quality images from textual descriptions. Recent works like LayerDiffuse have extended the previous single-layer, unified image generation paradigm to transparent image layer generation. However, existing multi-layer generation methods fail to handle the interactions among multiple layers such as rational global layout, physics-plausible contacts and visual effects like shadows and reflections while maintaining high alpha quality. To solve this problem, we propose PSDiffusion, a unified diffusion framework for simultaneous multi-layer text-to-image generation. Our model can automatically generate multi-layer images with one RGB background and multiple RGBA foregrounds through a single feed-forward process. Unlike existing methods that combine multiple tools for post-decomposition or generate layers sequentially and separately, our method introduces a global-layer interactive mechanism that generates layered-images concurrently and collaboratively, ensuring not only high quality and completeness for each layer, but also spatial and visual interactions among layers for global coherence.