🤖 AI Summary
Classifier-Free Guidance (CFG) suffers from entangled geometric and semantic representations in complex compositional tasks due to its reliance on semantically vacuous null prompts, which limits generation fidelity. This work proposes Conditional Degradation Guidance (CDG), which replaces the null prompt with a semantically partially degraded condition, shifting the guidance paradigm from “good versus empty” to “good versus nearly good” to enhance control precision. By analyzing the functional differentiation between content and context tokens in Transformer-based text encoders, CDG selectively degrades content tokens to construct adaptive negative samples, yielding a plug-and-play method that requires no additional models or training. Evaluated across architectures including Stable Diffusion 3, FLUX, and Qwen-Image, CDG consistently improves compositional generation accuracy and image-text alignment with negligible computational overhead.
📝 Abstract
Classifier-Free Guidance (CFG) is a cornerstone of modern text-to-image models, yet its reliance on a semantically vacuous null prompt ($\varnothing$) generates a guidance signal prone to geometric entanglement. This is a key factor limiting its precision, leading to well-documented failures in complex compositional tasks. We propose Condition-Degradation Guidance (CDG), a novel paradigm that replaces the null prompt with a strategically degraded condition, $\boldsymbol{c}_{\text{deg}}$. This reframes guidance from a coarse "good vs. null" contrast to a more refined "good vs. almost good" discrimination, thereby compelling the model to capture fine-grained semantic distinctions. We find that tokens in transformer text encoders split into two functional roles: content tokens encoding object semantics, and context-aggregating tokens capturing global context. By selectively degrading only the former, CDG constructs $\boldsymbol{c}_{\text{deg}}$ without external models or training. Validated across diverse architectures including Stable Diffusion 3, FLUX, and Qwen-Image, CDG markedly improves compositional accuracy and text-image alignment. As a lightweight, plug-and-play module, it achieves this with negligible computational overhead. Our work challenges the reliance on static, information-sparse negative samples and establishes a new principle for diffusion guidance: the construction of adaptive, semantically-aware negative samples is critical to achieving precise semantic control. Code is available at https://github.com/Ming-321/Classifier-Degradation-Guidance.