Dynamic Classifier-Free Diffusion Guidance via Online Feedback

📅 2025-09-19

📈 Citations: 0

✨ Influential: 0

career value

158K/year

🤖 AI Summary

Existing classifier-free guidance (CFG) in text-to-image diffusion models employs static guidance scales, resulting in poor prompt adaptability and limited generalization. To address this, we propose a dynamic CFG framework that computes and greedily optimizes the CFG scale at each reverse-diffusion step based on multi-dimensional latent-space evaluations—specifically CLIP alignment, discriminator-based fidelity, and human preference rewards—enabling prompt- and sample-specific adaptive guidance. This work introduces the first online feedback mechanism into CFG scheduling, overcoming the conventional “one-size-fits-all” limitation. Extensive evaluation on state-of-the-art models—including Imagen 3—demonstrates significant improvements in text-image alignment, visual quality, and textual rendering accuracy. Human preference studies show a 55.5% win rate over baseline CFG methods, confirming the effectiveness of our adaptive guidance strategy.

Technology Category

Application Category

📝 Abstract

Classifier-free guidance (CFG) is a cornerstone of text-to-image diffusion models, yet its effectiveness is limited by the use of static guidance scales. This "one-size-fits-all" approach fails to adapt to the diverse requirements of different prompts; moreover, prior solutions like gradient-based correction or fixed heuristic schedules introduce additional complexities and fail to generalize. In this work, we challeng this static paradigm by introducing a framework for dynamic CFG scheduling. Our method leverages online feedback from a suite of general-purpose and specialized small-scale latent-space evaluations, such as CLIP for alignment, a discriminator for fidelity and a human preference reward model, to assess generation quality at each step of the reverse diffusion process. Based on this feedback, we perform a greedy search to select the optimal CFG scale for each timestep, creating a unique guidance schedule tailored to every prompt and sample. We demonstrate the effectiveness of our approach on both small-scale models and the state-of-the-art Imagen 3, showing significant improvements in text alignment, visual quality, text rendering and numerical reasoning. Notably, when compared against the default Imagen 3 baseline, our method achieves up to 53.8% human preference win-rate for overall preference, a figure that increases up to to 55.5% on prompts targeting specific capabilities like text rendering. Our work establishes that the optimal guidance schedule is inherently dynamic and prompt-dependent, and provides an efficient and generalizable framework to achieve it.

Problem

Research questions and friction points this paper is trying to address.

Dynamic CFG scheduling for text-to-image diffusion models

Adaptive guidance scales based on online feedback evaluation

Optimizing text alignment and visual quality per prompt

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic CFG scheduling via online feedback

Greedy search for optimal per-timestep scale

Multi-metric evaluation including CLIP and discriminator

🔎 Similar Papers

Human-Feedback Efficient Reinforcement Learning for Online Diffusion Model Finetuning