Grouping First, Attending Smartly: Training-Free Acceleration for Diffusion Transformers

📅 2025-05-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Diffusion Transformers suffer from prohibitively slow inference and high computational cost when generating high-resolution images/videos (e.g., >1 hour on an A100 for 8192×8192 image synthesis). To address this, we propose GRAT—a training-free attention acceleration method for pretrained diffusion Transformers. GRAT is the first to exploit the intrinsic local sparsity of attention maps in such models, introducing token grouping with shared attention computation and a structured key-value region constraint—combining local block masking and cross-shaped cropping. These mechanisms are integrated into a GPU-optimized, non-overlapping parallel computation architecture. Without any fine-tuning or retraining, GRAT preserves full-attention generation quality while achieving 35.8× speedup for 8192×8192 image synthesis. We validate its generality and effectiveness across two state-of-the-art diffusion Transformer frameworks: Flux and HunyuanVideo.

Technology Category

Application Category

📝 Abstract
Diffusion-based Transformers have demonstrated impressive generative capabilities, but their high computational costs hinder practical deployment, for example, generating an $8192 imes 8192$ image can take over an hour on an A100 GPU. In this work, we propose GRAT ( extbf{GR}ouping first, extbf{AT}tending smartly), a training-free attention acceleration strategy for fast image and video generation without compromising output quality. The key insight is to exploit the inherent sparsity in learned attention maps (which tend to be locally focused) in pretrained Diffusion Transformers and leverage better GPU parallelism. Specifically, GRAT first partitions contiguous tokens into non-overlapping groups, aligning both with GPU execution patterns and the local attention structures learned in pretrained generative Transformers. It then accelerates attention by having all query tokens within the same group share a common set of attendable key and value tokens. These key and value tokens are further restricted to structured regions, such as surrounding blocks or criss-cross regions, significantly reducing computational overhead (e.g., attaining a extbf{35.8$ imes$} speedup over full attention when generating $8192 imes 8192$ images) while preserving essential attention patterns and long-range context. We validate GRAT on pretrained Flux and HunyuanVideo for image and video generation, respectively. In both cases, GRAT achieves substantially faster inference without any fine-tuning, while maintaining the performance of full attention. We hope GRAT will inspire future research on accelerating Diffusion Transformers for scalable visual generation.
Problem

Research questions and friction points this paper is trying to address.

Reduce high computational costs in Diffusion Transformers
Accelerate attention without compromising output quality
Improve GPU parallelism for faster image and video generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Grouping tokens for efficient GPU execution
Shared key-value tokens within groups
Structured regions reduce computational overhead
🔎 Similar Papers
No similar papers found.