DICE: Distilling Classifier-Free Guidance into Text Embeddings

📅 2025-02-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Text-to-image diffusion models often suffer from inadequate prompt alignment and the high computational cost and theoretical inconsistency introduced by classifier-free guidance (CFG). This paper proposes a CFG-free alignment generation paradigm that, for the first time, fully distills CFG’s semantic guidance capability into the text embedding space. Through knowledge distillation, our method optimizes text encoding, explicitly models CFG-directional gradients, and is compatible with mainstream architectures—including Stable Diffusion v1.5, SDXL, and PixArt-α. At inference, it eliminates redundant conditional branches, restoring theoretical consistency to the diffusion process. Experiments demonstrate that our approach achieves prompt alignment quality on par with CFG across multiple models, accelerates sampling by approximately 2×, and natively supports negative prompt editing—thereby significantly improving both generation efficiency and image fidelity.

Technology Category

Application Category

📝 Abstract
Text-to-image diffusion models are capable of generating high-quality images, but these images often fail to align closely with the given text prompts. Classifier-free guidance (CFG) is a popular and effective technique for improving text-image alignment in the generative process. However, using CFG introduces significant computational overhead and deviates from the established theoretical foundations of diffusion models. In this paper, we present DIstilling CFG by enhancing text Embeddings (DICE), a novel approach that removes the reliance on CFG in the generative process while maintaining the benefits it provides. DICE distills a CFG-based text-to-image diffusion model into a CFG-free version by refining text embeddings to replicate CFG-based directions. In this way, we avoid the computational and theoretical drawbacks of CFG, enabling high-quality, well-aligned image generation at a fast sampling speed. Extensive experiments on multiple Stable Diffusion v1.5 variants, SDXL and PixArt-$alpha$ demonstrate the effectiveness of our method. Furthermore, DICE supports negative prompts for image editing to improve image quality further. Code will be available soon.
Problem

Research questions and friction points this paper is trying to address.

Improves text-image alignment
Reduces computational overhead
Maintains CFG benefits without CFG
Innovation

Methods, ideas, or system contributions that make the work stand out.

Distills CFG into embeddings
Enhances text-image alignment
Reduces computational overhead
🔎 Similar Papers
No similar papers found.
Z
Zhenyu Zhou
Zhejiang University
Defang Chen
Defang Chen
University at Buffalo, SUNY
Machine LearningDiffusion ModelsKnowledge DistillationStatistical Mechanics
C
Can Wang
Zhejiang University
C
Chun Chen
Zhejiang University
S
Siwei Lyu
University at Buffalo