Not all tokens contribute equally to diffusion learning

📅 2026-04-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the susceptibility of existing conditional diffusion models to interference from tokens with low semantic importance under classifier-free guidance (CFG), which often leads to biased or semantically incomplete generations. To mitigate this issue, the authors propose the DARE framework, which introduces distribution-rectified CFG (DR-CFG) to dynamically suppress low-semantic-density tokens and incorporates a spatial representation alignment (SRA) mechanism to enhance the spatial guidance of high-semantic tokens. By jointly optimizing attention allocation and training objectives through distribution debiasing and spatial consistency, DARE significantly improves generation fidelity and semantic alignment across multiple benchmarks, outperforming current state-of-the-art methods.
📝 Abstract
With the rapid development of conditional diffusion models, significant progress has been made in text-to-video generation. However, we observe that these models often neglect semantically important tokens during inference, leading to biased or incomplete generations under classifier-free guidance. We attribute this issue to two key factors: distributional bias caused by the long-tailed token frequency in training data, and spatial misalignment in cross-attention where semantically important tokens are overshadowed by less informative ones. To address these issues, we propose Distribution-Aware Rectification and Spatial Ensemble (DARE), a unified framework that improves semantic guidance in diffusion models from the perspectives of distributional debiasing and spatial consistency. First, we introduce Distribution-Rectified Classifier-Free Guidance (DR-CFG), which regularizes the training process by dynamically suppressing dominant tokens with low semantic density, encouraging the model to better capture underrepresented semantic cues and learn a more balanced conditional distribution. This design mitigates the risk of the model distribution overfitting to tokens with low semantic density. Second, we propose Spatial Representation Alignment (SRA), which adaptively reweights cross-attention maps according to token importance and enforces representation consistency, enabling semantically important tokens to exert stronger spatial guidance during generation. This mechanism effectively prevents low semantic-density tokens from dominating the attention allocation, thereby avoiding the dilution of the spatial and distributional guidance provided by high semantic-density tokens. Extensive experiments on multiple benchmark datasets demonstrate that DARE consistently improves generation fidelity and semantic alignment, achieving significant gains over existing approaches.
Problem

Research questions and friction points this paper is trying to address.

conditional diffusion models
text-to-video generation
classifier-free guidance
semantic tokens
cross-attention
Innovation

Methods, ideas, or system contributions that make the work stand out.

diffusion models
classifier-free guidance
semantic alignment
cross-attention
distribution debiasing
🔎 Similar Papers
No similar papers found.
G
Guoqing Zhang
State Key Laboratory of Advanced Rail Autonomous Operation, Beijing Jiaotong University, Beijing, China; School of Computer Science and Technology, Beijing Jiaotong University, Beijing, China; Visual Intelligence +X International Cooperation Joint Laboratory of MOE, Beijing Jiaotong University, Beijing, China
Lu Shi
Lu Shi
Postdoc, Tsinghua University
RoboticsControlData-DrivenKoopman Operator
W
Wanru Xu
State Key Laboratory of Advanced Rail Autonomous Operation, Beijing Jiaotong University, Beijing, China; School of Computer Science and Technology, Beijing Jiaotong University, Beijing, China; Visual Intelligence +X International Cooperation Joint Laboratory of MOE, Beijing Jiaotong University, Beijing, China
L
Linna Zhang
School of Mechanical Engineering, Guizhou University, Guizhou, China
S
Sen Wang
Seed, ByteDance, Beijing, China
Fangfang Wang
Fangfang Wang
Zhejiang University
computer visionmachine learning
Y
Yigang Cen
State Key Laboratory of Advanced Rail Autonomous Operation, Beijing Jiaotong University, Beijing, China; School of Computer Science and Technology, Beijing Jiaotong University, Beijing, China; Visual Intelligence +X International Cooperation Joint Laboratory of MOE, Beijing Jiaotong University, Beijing, China