🤖 AI Summary
Text-to-image diffusion models inherently produce flat, single-layer outputs, making professional-grade hierarchical editing infeasible. Existing approaches either require extensive private-data fine-tuning or generate isolated foregrounds without semantic coherence across full scenes. To address this, we propose a zero-shot hierarchical generation framework that operates entirely without training or auxiliary data. Our method jointly synthesizes foreground, background, and composite layers within the intermediate latent space of diffusion models via noise transplantation and collaborative latent optimization. To our knowledge, this is the first approach to achieve semantically consistent, fully layered scene generation under zero-shot conditions while preserving structural coherence across layers. Quantitative and qualitative evaluations demonstrate that our method matches fine-tuned baselines in image fidelity and inter-layer consistency, significantly enhancing controllability and practicality for downstream tasks such as complex compositional editing.
📝 Abstract
Despite the remarkable success of text-to-image diffusion models, their output of a single, flattened image remains a critical bottleneck for professional applications requiring layer-wise control. Existing solutions either rely on fine-tuning with large, inaccessible datasets or are training-free yet limited to generating isolated foreground elements, failing to produce a complete and coherent scene. To address this, we introduce the Training-free Noise Transplantation and Cultivation Diffusion Model (TAUE), a novel framework for zero-shot, layer-wise image generation. Our core technique, Noise Transplantation and Cultivation (NTC), extracts intermediate latent representations from both foreground and composite generation processes, transplanting them into the initial noise for subsequent layers. This ensures semantic and structural coherence across foreground, background, and composite layers, enabling consistent, multi-layered outputs without requiring fine-tuning or auxiliary datasets. Extensive experiments show that our training-free method achieves performance comparable to fine-tuned methods, enhancing layer-wise consistency while maintaining high image quality and fidelity. TAUE not only eliminates costly training and dataset requirements but also unlocks novel downstream applications, such as complex compositional editing, paving the way for more accessible and controllable generative workflows.