DraftAttention: Fast Video Diffusion via Low-Resolution Attention Guidance

📅 2025-05-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Video diffusion Transformers (DiTs) achieve high generation quality but suffer from prohibitively slow inference—attention computation accounts for over 80% of latency, requiring tens of minutes to generate an 8-second 720p video. To address this, we propose a training-free, dynamic structured sparse attention framework: we construct a low-resolution draft attention map in the compressed latent space and theoretically prove its high-fidelity approximation to the full-resolution attention; this draft then guides hardware-efficient sparse attention computation at full resolution. Our method integrates cross-frame downsampling, Q/K/V reordering, and GPU-optimized sparse kernels. Without any fine-tuning, it significantly accelerates end-to-end inference—up to 1.75× speedup—reducing generation time for 8-second 720p videos substantially while outperforming existing sparse baselines in visual quality.

Technology Category

Application Category

📝 Abstract
Diffusion transformer-based video generation models (DiTs) have recently attracted widespread attention for their excellent generation quality. However, their computational cost remains a major bottleneck-attention alone accounts for over 80% of total latency, and generating just 8 seconds of 720p video takes tens of minutes-posing serious challenges to practical application and scalability. To address this, we propose the DraftAttention, a training-free framework for the acceleration of video diffusion transformers with dynamic sparse attention on GPUs. We apply down-sampling to each feature map across frames in the compressed latent space, enabling a higher-level receptive field over the latent composed of hundreds of thousands of tokens. The low-resolution draft attention map, derived from draft query and key, exposes redundancy both spatially within each feature map and temporally across frames. We reorder the query, key, and value based on the draft attention map to guide the sparse attention computation in full resolution, and subsequently restore their original order after the attention computation. This reordering enables structured sparsity that aligns with hardware-optimized execution. Our theoretical analysis demonstrates that the low-resolution draft attention closely approximates the full attention, providing reliable guidance for constructing accurate sparse attention. Experimental results show that our method outperforms existing sparse attention approaches in video generation quality and achieves up to 1.75x end-to-end speedup on GPUs. Code: https://github.com/shawnricecake/draft-attention
Problem

Research questions and friction points this paper is trying to address.

Reducing computational cost in video diffusion transformers
Accelerating attention computation via low-resolution guidance
Improving video generation efficiency without quality loss
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses low-resolution draft attention guidance
Applies dynamic sparse attention on GPUs
Reorders query, key, value for sparsity
🔎 Similar Papers