🤖 AI Summary
Classifier-Free Guidance (CFG) improves generation quality and prompt alignment in conditional diffusion models but severely compromises sample diversity, leading to a fundamental quality-diversity trade-off. This work first identifies that CFG does not correspond to a rigorously defined denoising diffusion process, revealing the absence of a critical Rényi divergence correction term. Building on this theoretical insight, we propose the first theoretically consistent, diversity-preserving CFG-enhanced sampling framework: a Gibbs-like iterative reweighting sampler. Unlike standard CFG, our method explicitly incorporates the missing divergence correction while maintaining computational efficiency. Extensive experiments across image and text-to-audio generation demonstrate consistent superiority over baseline CFG—achieving significant improvements in FID, CLIP Score, diversity metrics (e.g., LPIPS diversity), and human evaluation scores—thereby reconciling high-fidelity generation with rich sample diversity.
📝 Abstract
Classifier-Free Guidance (CFG) is a widely used technique for improving conditional diffusion models by linearly combining the outputs of conditional and unconditional denoisers. While CFG enhances visual quality and improves alignment with prompts, it often reduces sample diversity, leading to a challenging trade-off between quality and diversity. To address this issue, we make two key contributions. First, CFG generally does not correspond to a well-defined denoising diffusion model (DDM). In particular, contrary to common intuition, CFG does not yield samples from the target distribution associated with the limiting CFG score as the noise level approaches zero -- where the data distribution is tilted by a power $w gt 1$ of the conditional distribution. We identify the missing component: a R'enyi divergence term that acts as a repulsive force and is required to correct CFG and render it consistent with a proper DDM. Our analysis shows that this correction term vanishes in the low-noise limit. Second, motivated by this insight, we propose a Gibbs-like sampling procedure to draw samples from the desired tilted distribution. This method starts with an initial sample from the conditional diffusion model without CFG and iteratively refines it, preserving diversity while progressively enhancing sample quality. We evaluate our approach on both image and text-to-audio generation tasks, demonstrating substantial improvements over CFG across all considered metrics. The code is available at https://github.com/yazidjanati/cfgig