RT-Lynx: Putting the GEMM Sparsity In a Right Way for Diffusion Models

📅 2026-05-26

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Diffusion models suffer from high inference costs, and existing sparsification approaches predominantly focus on weight pruning, often degrading generation quality. This work makes the novel observation that activations in Diffusion Transformers (DiTs) exhibit inherent sparsity and are notably more robust to N:M semi-structured sparsity than weights. Building on this insight, the authors propose applying N:M sparsity to activations rather than weights, complemented by an error compensation mechanism and highly optimized CUDA kernels. The resulting method preserves original generation quality across multiple diffusion models while achieving an average 1.55× speedup in linear layers, thereby shifting the sparsification paradigm from weights to activations.

📝 Abstract

Diffusion Transformers (DiT) achieve strong performance in image generation but incur substantial inference costs. While prior work has reduced this cost via quantization and distillation, semi-structured sparsity, which can nearly halve FLOPs, remains underexplored. A key reason is that most existing approaches focus on weight sparsification, and pruning 50% of the weights can remove critical model capacity and degrade generation quality. Our study, however, shows that DiT activations are intrinsically sparse and significantly more robust to N:M semi-structured sparsification than weights. Motivated by this observation, we advocate a paradigm shift from weight sparsification to activation sparsification. We propose RT-Lynx, which applies N:M sparsification to activations and incorporates error-compensation techniques to mitigate accuracy loss. We further implement highly optimized CUDA kernels tailored to this setting, achieving up to a 1.55x speedup on average in linear layers. Extensive experiments across multiple diffusion models demonstrate that our method preserves the generation quality of the original models while substantially accelerating inference.

Problem

Research questions and friction points this paper is trying to address.

Diffusion Transformers

inference cost

semi-structured sparsity

activation sparsity

generation quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

activation sparsification

N:M sparsity

diffusion transformers