🤖 AI Summary
To address the trade-off between image quality and text alignment in text-to-image diffusion models caused by fixed Classifier-Free Guidance (CFG) scale, this work proposes an annealing-based dynamic guidance mechanism that introduces no additional parameters or memory overhead. Our method learns an adaptive scheduling strategy from conditional noise signals, enabling real-time adjustment of the CFG scale during denoising to jointly optimize generation stability and semantic fidelity. Experiments on multiple benchmarks demonstrate significant improvements in quantitative metrics—including FID and CLIP Score—while preserving inference efficiency. Moreover, the proposed approach substantially mitigates CFG’s sensitivity to hyperparameter tuning, yielding a more robust and efficient sampling guidance paradigm for controllable image generation.
📝 Abstract
Denoising diffusion models excel at generating high-quality images conditioned on text prompts, yet their effectiveness heavily relies on careful guidance during the sampling process. Classifier-Free Guidance (CFG) provides a widely used mechanism for steering generation by setting the guidance scale, which balances image quality and prompt alignment. However, the choice of the guidance scale has a critical impact on the convergence toward a visually appealing and prompt-adherent image. In this work, we propose an annealing guidance scheduler which dynamically adjusts the guidance scale over time based on the conditional noisy signal. By learning a scheduling policy, our method addresses the temperamental behavior of CFG. Empirical results demonstrate that our guidance scheduler significantly enhances image quality and alignment with the text prompt, advancing the performance of text-to-image generation. Notably, our novel scheduler requires no additional activations or memory consumption, and can seamlessly replace the common classifier-free guidance, offering an improved trade-off between prompt alignment and quality.