π€ AI Summary
This work addresses the substantial computational overhead of high-resolution Transformer-based diffusion models, which stems from the quadratic complexity of self-attention and their iterative generation process. To tackle this challenge, the authors propose DiSCβa resolution-scalable, sparsity-aware hardware accelerator that uniquely integrates token-level cache reuse (CTR) with soft-threshold sparsity mask reuse (ST) to eliminate redundancy and induce attention sparsity. DiSC further employs a hash-based on-chip memory distribution scheme and a unified dataflow architecture, enabling efficient processing of mixed sparsity patterns without requiring specialized sparse hardware. Evaluated on DiT and PixArt-Sigma models, DiSC achieves 2.48β4.74Γ speedup over NVIDIA A100/H100 GPUs while reducing energy consumption by 46.4%β68.1%.
π Abstract
Transformer-based diffusion models offer superior scalability and performance but suffer from high computational overhead due to the iterative nature and quadratic complexity of self-attention at high resolutions. In this paper, we propose DiSC, a resolution-scalable, sparsity-aware hardware accelerator. At the software level, DiSC introduces two algorithms: Cached Token Reuse (CTR), and Softmax Thresholding with Sparsity Mask Reuse (ST). CTR introduces a mechanism that translates spatial variations in the input latent difference across steps into a token-level reuse decision, effectively eliminating redundant token computation. ST induces sparsity in attention operations by reusing a generated sparsity pattern, leveraging temporal similarity to bypass costly prediction overhead. Together, these algorithms provide resolution-scalable computational benefits and yield a moderate sparsity and hybrid dense-sparse workload.
To exploit this efficiently, we design a specialized hardware architecture and unified dataflow. This architecture avoids dedicated sparsity-handling components; instead, a hash-based distribution over on-chip memory banks allows DiSC to reuse its existing compute engines for sparse operations, efficiently exploiting the induced sparsity with minimal hardware overhead. Evaluated on DiT and PixArt-Sigma, DiSC achieves 3.47-4.74x and 2.48-3.50x speedups over NVIDIA A100 and H100 GPUs, respectively, with energy savings ranging from 46.4% to 68.1%.