Attention Sinks in Diffusion Transformers: A Causal Analysis

📅 2026-05-10

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

This study investigates the role of attention sinks—dominant attention receivers—in diffusion Transformers for text-to-image semantic alignment. Employing a training-agnostic, dynamic intervention method, the authors identify and suppress these attention sinks at each denoising timestep to causally analyze their impact. Large-scale experiments are conducted on Stable Diffusion 3 and SDXL using 553 GenEval prompts, with alignment assessed via multidimensional metrics including CLIP-T, ImageReward, and HPS-v2. Results show that suppressing a single attention sink induces noticeable perceptual changes without degrading semantic alignment; only strong interventions (k ≥ 10) trigger marginal degradation in HPS-v2 scores, while CLIP-T remains robust. The proposed method outperforms random masking by approximately sixfold and represents the first empirical demonstration of decoupling trajectory perturbations from semantic consistency.

📝 Abstract

Attention sinks -- tokens that receive disproportionate attention mass -- are assumed to be functionally important in autoregressive language models, but their role in diffusion transformers remains unclear. We present a causal analysis in text-to-image diffusion, dynamically identifying dominant attention recipients per timestep and suppressing them via paired, training-free interventions on the score and value paths. Across 553 GenEval prompts on Stable Diffusion~3 (with SDXL corroboration), removing these sinks does not degrade text-image alignment (CLIP-T) or preference proxies (ImageReward, HPS-v2) at $k{=}1$; only under stronger interventions ($k\!\geq\!10$) does HPS-v2 exhibit a metric-dependent boundary, while CLIP-T remains robust throughout. The perceptual shifts induced by suppression are nonetheless \emph{sink-specific} -- $\sim\!6\times$ larger than equal-budget random masking -- revealing an empirical dissociation between trajectory-level perturbation and \emph{semantic alignment} in diffusion transformers. \footnote{Code available at https://github.com/wfz666/ICML26-attention-sink.}

Problem

Research questions and friction points this paper is trying to address.

attention sinks

diffusion transformers

text-to-image generation

semantic alignment

causal analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

attention sinks

diffusion transformers

causal intervention