DiTFastAttnV2: Head-wise Attention Compression for Multi-Modality Diffusion Transformers

📅 2025-03-28

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

To address the high computational cost and deployment inefficiency of attention computation in Multimodal Diffusion Transformers (MMDiT) for text-to-image generation, this paper proposes a post-training attention compression method. Our approach introduces three key contributions: (1) head-granularity dynamic pruning coupled with arrow attention, specifically designed to accommodate MMDiT’s cross-modal attention architecture—distinct from standard DiT; (2) a locality-aware, metric-driven dynamic caching strategy integrated with custom CUDA fused kernels, reducing compression search time to minutes; and (3) joint optimization of quantization and sparsity. Experiments demonstrate a 68% reduction in attention FLOPs, 1.5× end-to-end speedup for 2K-resolution image generation, and preservation of visual fidelity without perceptible quality degradation.

Technology Category

Application Category

📝 Abstract

Text-to-image generation models, especially Multimodal Diffusion Transformers (MMDiT), have shown remarkable progress in generating high-quality images. However, these models often face significant computational bottlenecks, particularly in attention mechanisms, which hinder their scalability and efficiency. In this paper, we introduce DiTFastAttnV2, a post-training compression method designed to accelerate attention in MMDiT. Through an in-depth analysis of MMDiT's attention patterns, we identify key differences from prior DiT-based methods and propose head-wise arrow attention and caching mechanisms to dynamically adjust attention heads, effectively bridging this gap. We also design an Efficient Fused Kernel for further acceleration. By leveraging local metric methods and optimization techniques, our approach significantly reduces the search time for optimal compression schemes to just minutes while maintaining generation quality. Furthermore, with the customized kernel, DiTFastAttnV2 achieves a 68% reduction in attention FLOPs and 1.5x end-to-end speedup on 2K image generation without compromising visual fidelity.

Problem

Research questions and friction points this paper is trying to address.

Reduces computational bottlenecks in MMDiT attention mechanisms

Accelerates attention via head-wise compression and caching

Maintains image quality while cutting FLOPs and boosting speed

Innovation

Methods, ideas, or system contributions that make the work stand out.

Head-wise arrow attention compression

Efficient Fused Kernel acceleration

Dynamic attention head adjustment

🔎 Similar Papers

Position IDs Matter: An Enhanced Position Layout for Efficient Context Compression in Large Language Models