Dynamic Differential Linear Attention: Enhancing Linear Diffusion Transformer for High-Quality Image Generation

📅 2026-01-20

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Although Linear Diffusion Transformers (LiT) reduce the computational complexity of self-attention, they often suffer from degraded generation quality due to overly smoothed attention weights. To address this issue, this work proposes Dynamic Differential Linear Attention (DyDiLA), which incorporates a dynamic projection module to decouple token representations, employs a dynamic metric kernel to enhance semantic discriminability, and introduces a token differential operator to improve the robustness of query-key matching. The resulting DyDi-LiT model effectively mitigates the over-smoothing problem while maintaining computational efficiency, achieving state-of-the-art performance across multiple high-fidelity image generation benchmarks. These results demonstrate a dual improvement in both representational capacity and generative quality.

Technology Category

Application Category

📝 Abstract

Diffusion transformers (DiTs) have emerged as a powerful architecture for high-fidelity image generation, yet the quadratic cost of self-attention poses a major scalability bottleneck. To address this, linear attention mechanisms have been adopted to reduce computational cost; unfortunately, the resulting linear diffusion transformers (LiTs) models often come at the expense of generative performance, frequently producing over-smoothed attention weights that limit expressiveness. In this work, we introduce Dynamic Differential Linear Attention (DyDiLA), a novel linear attention formulation that enhances the effectiveness of LiTs by mitigating the oversmoothing issue and improving generation quality. Specifically, the novelty of DyDiLA lies in three key designs: (i) dynamic projection module, which facilitates the decoupling of token representations by learning with dynamically assigned knowledge; (ii) dynamic measure kernel, which provides a better similarity measurement to capture fine-grained semantic distinctions between tokens by dynamically assigning kernel functions for token processing; and (iii) token differential operator, which enables more robust query-to-key retrieval by calculating the differences between the tokens and their corresponding information redundancy produced by dynamic measure kernel. To capitalize on DyDiLA, we introduce a refined LiT, termed DyDi-LiT, that systematically incorporates our advancements. Extensive experiments show that DyDi-LiT consistently outperforms current state-of-the-art (SOTA) models across multiple metrics, underscoring its strong practical potential.

Problem

Research questions and friction points this paper is trying to address.

linear attention

over-smoothed attention

diffusion transformers

image generation

scalability bottleneck

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic Differential Linear Attention

Linear Diffusion Transformer

Dynamic Projection