🤖 AI Summary
This work addresses the lack of theoretical grounding for classifier-free guidance (CFG) in diffusion models. We propose an interpretable analytical framework based on linearized modeling, leveraging contrastive principal component (CPC) decomposition and noise-level validation to formally decouple CFG into a tripartite synergistic mechanism: (i) class-mean-driven mean shift, (ii) forward enhancement of class-specific features via dominant principal components, and (iii) backward suppression of generic features through orthogonal principal components. Our theoretical analysis is rigorously validated across a broad noise range on realistic nonlinear diffusion models. This constitutes the first systematic theoretical foundation for CFG, enabling a paradigm shift from black-box guidance to interpretable, feature-level control. As a result, image generation fidelity and conditional controllability are significantly improved.
📝 Abstract
Classifier-free guidance (CFG) is a core technique powering state-of-the-art image generation systems, yet its underlying mechanisms remain poorly understood. In this work, we begin by analyzing CFG in a simplified linear diffusion model, where we show its behavior closely resembles that observed in the nonlinear case. Our analysis reveals that linear CFG improves generation quality via three distinct components: (i) a mean-shift term that approximately steers samples in the direction of class means, (ii) a positive Contrastive Principal Components (CPC) term that amplifies class-specific features, and (iii) a negative CPC term that suppresses generic features prevalent in unconditional data. We then verify that these insights in real-world, nonlinear diffusion models: over a broad range of noise levels, linear CFG resembles the behavior of its nonlinear counterpart. Although the two eventually diverge at low noise levels, we discuss how the insights from the linear analysis still shed light on the CFG's mechanism in the nonlinear regime.