🤖 AI Summary
This work addresses the limited controllability of global structure and color in existing text-to-image diffusion models, which typically initialize generation from white Gaussian noise. The study reveals, for the first time, that low-frequency components in the input noise predominantly govern the overall layout and tonal distribution of the generated image. Building on this insight, the authors propose a training-free, computationally lightweight frequency-domain guidance method: during inference, low-frequency image priors are directly injected into the low-frequency portion of the noise. This approach enables effective manipulation of global image attributes while preserving high-frequency detail diversity, thereby substantially enhancing the controllability and practical utility of conditional image generation.
📝 Abstract
Text-to-image diffusion models generate images by gradually converting white Gaussian noise into a natural image. White Gaussian noise is well suited for producing diverse outputs from a single text prompt due to its absence of structure. However, this very property limits control over, and predictability of, specific visual attributes, as the noise is not human-interpretable. In this work, we investigate the characteristics of the input noise in diffusion models. We show that, although all frequencies in white Gaussian noise have comparable statistical energy, low-frequency components primarily determine the images global structure and color composition, while high-frequency components control finer details. Building on this observation, we demonstrate that simple manipulations of the low-frequency noise using low-frequency image priors can effectively condition the generation process to reconstruct these low-frequency visual cues. This allows us to define a simple, training-free method with minimal overhead that steers overall image structure and color, while letting high-frequency components freely emerge as fine details, enabling variability across generated outputs.