Not all tokens contribute equally to diffusion learning

📅 2026-04-08

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This work addresses the susceptibility of existing conditional diffusion models to interference from tokens with low semantic importance under classifier-free guidance (CFG), which often leads to biased or semantically incomplete generations. To mitigate this issue, the authors propose the DARE framework, which introduces distribution-rectified CFG (DR-CFG) to dynamically suppress low-semantic-density tokens and incorporates a spatial representation alignment (SRA) mechanism to enhance the spatial guidance of high-semantic tokens. By jointly optimizing attention allocation and training objectives through distribution debiasing and spatial consistency, DARE significantly improves generation fidelity and semantic alignment across multiple benchmarks, outperforming current state-of-the-art methods.

Technology Category

Application Category

📝 Abstract

With the rapid development of conditional diffusion models, significant progress has been made in text-to-video generation. However, we observe that these models often neglect semantically important tokens during inference, leading to biased or incomplete generations under classifier-free guidance. We attribute this issue to two key factors: distributional bias caused by the long-tailed token frequency in training data, and spatial misalignment in cross-attention where semantically important tokens are overshadowed by less informative ones. To address these issues, we propose Distribution-Aware Rectification and Spatial Ensemble (DARE), a unified framework that improves semantic guidance in diffusion models from the perspectives of distributional debiasing and spatial consistency. First, we introduce Distribution-Rectified Classifier-Free Guidance (DR-CFG), which regularizes the training process by dynamically suppressing dominant tokens with low semantic density, encouraging the model to better capture underrepresented semantic cues and learn a more balanced conditional distribution. This design mitigates the risk of the model distribution overfitting to tokens with low semantic density. Second, we propose Spatial Representation Alignment (SRA), which adaptively reweights cross-attention maps according to token importance and enforces representation consistency, enabling semantically important tokens to exert stronger spatial guidance during generation. This mechanism effectively prevents low semantic-density tokens from dominating the attention allocation, thereby avoiding the dilution of the spatial and distributional guidance provided by high semantic-density tokens. Extensive experiments on multiple benchmark datasets demonstrate that DARE consistently improves generation fidelity and semantic alignment, achieving significant gains over existing approaches.

Problem

Research questions and friction points this paper is trying to address.

conditional diffusion models

text-to-video generation

classifier-free guidance

semantic tokens

cross-attention

Innovation

Methods, ideas, or system contributions that make the work stand out.

diffusion models

classifier-free guidance

semantic alignment