๐ค AI Summary
Existing conditional diffusion models suffer from a theoretical-practical disconnect in guidance mechanisms, limiting generation performance. This work first reveals the theoretical inefficacy of conventional guidance objectivesโsuch as Classifier-Free Guidance (CFG)โunder unconstrained settings, where no lookahead constraints are imposed. To bridge this gap, we propose RECTIFIED GRADIENT GUIDANCE (REG), a novel gradient-correction-based guidance paradigm. REG jointly reweights the joint distribution and applies theoretically grounded gradient correction to approximate the unconstrained optimal solution, while maintaining plug-and-play compatibility with existing guidance methods. Evaluated on ImageNet class-conditional generation and multi-scale text-to-image synthesis, REG consistently improves FID (โ2.1โ4.3), Inception Score (โ0.8โ1.9), and CLIP Score (โ0.07โ0.12). Our approach advances both the theoretical foundations and practical efficacy of diffusion guidance, unifying theory and practice.
๐ Abstract
Guidance techniques are simple yet effective for improving conditional generation in diffusion models. Albeit their empirical success, the practical implementation of guidance diverges significantly from its theoretical motivation. In this paper, we reconcile this discrepancy by replacing the scaled marginal distribution target, which we prove theoretically invalid, with a valid scaled joint distribution objective. Additionally, we show that the established guidance implementations are approximations to the intractable optimal solution under no future foresight constraint. Building on these theoretical insights, we propose rectified gradient guidance (REG), a versatile enhancement designed to boost the performance of existing guidance methods. Experiments on 1D and 2D demonstrate that REG provides a better approximation to the optimal solution than prior guidance techniques, validating the proposed theoretical framework. Extensive experiments on class-conditional ImageNet and text-to-image generation tasks show that incorporating REG consistently improves FID and Inception/CLIP scores across various settings compared to its absence.