🤖 AI Summary
This work identifies that the initial noise in text-to-image diffusion models (T2I) is not purely stochastic but inherently encodes interpretable semantic structure. To exploit this property, we propose a training-free, architecture-agnostic two-stage noise modulation framework: first, semantic components within the noise are identified via information-theoretic analysis; second, explicit control is achieved through semantic erasure and targeted re-injection. Grounded in a theoretical equivalence between the diffusion process and semantic injection, our approach requires neither fine-tuning nor model retraining. Extensive evaluation across diverse backbone architectures—including DiT and U-Net—demonstrates substantial improvements in inter-step consistency and text–image alignment fidelity. The method establishes a novel paradigm for controllable generation in diffusion models, enabling precise, semantics-aware noise manipulation without architectural or training modifications.
📝 Abstract
In text-driven content generation (T2C) diffusion model, semantic of generated content is mostly attributed to the process of text embedding and attention mechanism interaction. The initial noise of the generation process is typically characterized as a random element that contributes to the diversity of the generated content. Contrary to this view, this paper reveals that beneath the random surface of noise lies strong analyzable patterns. Specifically, this paper first conducts a comprehensive analysis of the impact of random noise on the model's generation. We found that noise not only contains rich semantic information, but also allows for the erasure of unwanted semantics from it in an extremely simple way based on information theory, and using the equivalence between the generation process of diffusion model and semantic injection to inject semantics into the cleaned noise. Then, we mathematically decipher these observations and propose a simple but efficient training-free and universal two-step"Semantic Erasure-Injection"process to modulate the initial noise in T2C diffusion model. Experimental results demonstrate that our method is consistently effective across various T2C models based on both DiT and UNet architectures and presents a novel perspective for optimizing the generation of diffusion model, providing a universal tool for consistent generation.