DiSC: Resolution-Scalable Acceleration of Diffusion Models by Exploiting Sparsity and Cached Token Reuse with Hash-based Distribution

📅 2026-05-25

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This work addresses the substantial computational overhead of high-resolution Transformer-based diffusion models, which stems from the quadratic complexity of self-attention and their iterative generation process. To tackle this challenge, the authors propose DiSC—a resolution-scalable, sparsity-aware hardware accelerator that uniquely integrates token-level cache reuse (CTR) with soft-threshold sparsity mask reuse (ST) to eliminate redundancy and induce attention sparsity. DiSC further employs a hash-based on-chip memory distribution scheme and a unified dataflow architecture, enabling efficient processing of mixed sparsity patterns without requiring specialized sparse hardware. Evaluated on DiT and PixArt-Sigma models, DiSC achieves 2.48–4.74× speedup over NVIDIA A100/H100 GPUs while reducing energy consumption by 46.4%–68.1%.

📝 Abstract

Transformer-based diffusion models offer superior scalability and performance but suffer from high computational overhead due to the iterative nature and quadratic complexity of self-attention at high resolutions. In this paper, we propose DiSC, a resolution-scalable, sparsity-aware hardware accelerator. At the software level, DiSC introduces two algorithms: Cached Token Reuse (CTR), and Softmax Thresholding with Sparsity Mask Reuse (ST). CTR introduces a mechanism that translates spatial variations in the input latent difference across steps into a token-level reuse decision, effectively eliminating redundant token computation. ST induces sparsity in attention operations by reusing a generated sparsity pattern, leveraging temporal similarity to bypass costly prediction overhead. Together, these algorithms provide resolution-scalable computational benefits and yield a moderate sparsity and hybrid dense-sparse workload. To exploit this efficiently, we design a specialized hardware architecture and unified dataflow. This architecture avoids dedicated sparsity-handling components; instead, a hash-based distribution over on-chip memory banks allows DiSC to reuse its existing compute engines for sparse operations, efficiently exploiting the induced sparsity with minimal hardware overhead. Evaluated on DiT and PixArt-Sigma, DiSC achieves 3.47-4.74x and 2.48-3.50x speedups over NVIDIA A100 and H100 GPUs, respectively, with energy savings ranging from 46.4% to 68.1%.

Problem

Research questions and friction points this paper is trying to address.

diffusion models

computational overhead

self-attention

high resolution

quadratic complexity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cached Token Reuse

Sparsity-aware Acceleration

Hash-based Distribution