🤖 AI Summary
Video diffusion Transformers suffer from prohibitive computational overhead in self-attention due to extremely long sequences—e.g., over 30,000 token embeddings for a 5-second video—rendering dense attention infeasible. Existing block-sparse approaches (e.g., skipping 64×64 blocks) employ coarse-grained sparsity patterns, underutilizing structural sparsity and limiting acceleration. To address this, we propose FG-Attn, a fine-grained sparse attention mechanism: (1) it introduces M×1 query-key slice-level skipping—the first of its kind—bypassing the rigid constraints of conventional block sparsity; and (2) it integrates asynchronous gather loading and tiled key-value packing to optimize sparse KV memory access and computation in shared memory. Evaluated on a single H100 GPU, FG-Attn achieves average speedups of 1.55× and 1.41× for generating 5-second 480p and 720p videos, respectively, with peak acceleration up to 1.65×, significantly enhancing video generation efficiency.
📝 Abstract
Generating realistic videos with diffusion transformers demands significant computation, with attention layers the central bottleneck; even producing a short clip requires running a transformer over a very long sequence of embeddings, e.g., more than 30K embeddings for a 5-second video, incurring significant latency. Prior work aims to mitigate this bottleneck by exploiting sparsity in the attention layers to reduce computation. However, these works typically rely on block-sparse attention, which skips score computation only when all entries in a block of attention scores (corresponding to M queries and M keys, with M = 64 typically) are zero. This coarse-granular skipping of attention scores does not fully exploit sparsity in the attention map and leaves room for improvement. In this work, we propose FG-Attn, a sparse attention mechanism for long-context diffusion transformers that leverages sparsity at a fine granularity. Unlike block-sparse attention, which skips entire MxM blocks, our approach skips computations at the granularity of Mx1 slices of the attention map. Each slice is produced by query-key dot products between a block of query vectors and a single key. To implement our proposed sparse attention mechanism, we develop a new efficient bulk-load operation called asynchronous-gather load. This load operation gathers a sparse set of relevant key-value vectors from memory and arranges them into packed tiles in the GPU's shared memory. Only a sparse set of keys relevant to those queries are loaded into shared memory when computing attention for a block of queries, in contrast to loading full blocks of key tokens in block-sparse attention. Our fine-grained sparse attention, applied to video diffusion models, achieves an average 1.55X (up to 1.65X) speedup for 5 second, 480p videos, and an average 1.41X (up to 1.49X) for 5 second, 720p videos on a single H100 GPU.