FG-Attn: Leveraging Fine-Grained Sparsity In Diffusion Transformers

📅 2025-09-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Video diffusion Transformers suffer from prohibitive computational overhead in self-attention due to extremely long sequences—e.g., over 30,000 token embeddings for a 5-second video—rendering dense attention infeasible. Existing block-sparse approaches (e.g., skipping 64×64 blocks) employ coarse-grained sparsity patterns, underutilizing structural sparsity and limiting acceleration. To address this, we propose FG-Attn, a fine-grained sparse attention mechanism: (1) it introduces M×1 query-key slice-level skipping—the first of its kind—bypassing the rigid constraints of conventional block sparsity; and (2) it integrates asynchronous gather loading and tiled key-value packing to optimize sparse KV memory access and computation in shared memory. Evaluated on a single H100 GPU, FG-Attn achieves average speedups of 1.55× and 1.41× for generating 5-second 480p and 720p videos, respectively, with peak acceleration up to 1.65×, significantly enhancing video generation efficiency.

Technology Category

Application Category

📝 Abstract
Generating realistic videos with diffusion transformers demands significant computation, with attention layers the central bottleneck; even producing a short clip requires running a transformer over a very long sequence of embeddings, e.g., more than 30K embeddings for a 5-second video, incurring significant latency. Prior work aims to mitigate this bottleneck by exploiting sparsity in the attention layers to reduce computation. However, these works typically rely on block-sparse attention, which skips score computation only when all entries in a block of attention scores (corresponding to M queries and M keys, with M = 64 typically) are zero. This coarse-granular skipping of attention scores does not fully exploit sparsity in the attention map and leaves room for improvement. In this work, we propose FG-Attn, a sparse attention mechanism for long-context diffusion transformers that leverages sparsity at a fine granularity. Unlike block-sparse attention, which skips entire MxM blocks, our approach skips computations at the granularity of Mx1 slices of the attention map. Each slice is produced by query-key dot products between a block of query vectors and a single key. To implement our proposed sparse attention mechanism, we develop a new efficient bulk-load operation called asynchronous-gather load. This load operation gathers a sparse set of relevant key-value vectors from memory and arranges them into packed tiles in the GPU's shared memory. Only a sparse set of keys relevant to those queries are loaded into shared memory when computing attention for a block of queries, in contrast to loading full blocks of key tokens in block-sparse attention. Our fine-grained sparse attention, applied to video diffusion models, achieves an average 1.55X (up to 1.65X) speedup for 5 second, 480p videos, and an average 1.41X (up to 1.49X) for 5 second, 720p videos on a single H100 GPU.
Problem

Research questions and friction points this paper is trying to address.

Reducing computational bottlenecks in diffusion transformers for video generation
Improving coarse-granular sparsity exploitation in attention mechanisms
Accelerating long-sequence attention computation in video diffusion models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-grained sparse attention mechanism for diffusion transformers
Asynchronous-gather load operation for efficient key-value retrieval
Skipping computations at Mx1 slice granularity in attention maps
🔎 Similar Papers
No similar papers found.
S
Sankeerth Durvasula
Department of Computer Science, University of Toronto, Toronto, Canada
K
Kavya Sreedhar
Google, Mountain View, USA
Z
Zain Moustafa
Department of Computer Science, University of Toronto, Toronto, Canada
Suraj Kothawade
Suraj Kothawade
Google
Machine Learning and Computer Vision
A
Ashish Gondimalla
Google, Sunnyvale, USA
Suvinay Subramanian
Suvinay Subramanian
Google
Computer Systems
N
Narges Shahidi
Google, Mountain View, USA
Nandita Vijaykumar
Nandita Vijaykumar
Assistant Professor, University of Toronto
Computer Systems and Architecture