DiSC: Resolution-Scalable Acceleration of Diffusion Models by Exploiting Sparsity and Cached Token Reuse with Hash-based Distribution

πŸ“… 2026-05-25
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the substantial computational overhead of high-resolution Transformer-based diffusion models, which stems from the quadratic complexity of self-attention and their iterative generation process. To tackle this challenge, the authors propose DiSCβ€”a resolution-scalable, sparsity-aware hardware accelerator that uniquely integrates token-level cache reuse (CTR) with soft-threshold sparsity mask reuse (ST) to eliminate redundancy and induce attention sparsity. DiSC further employs a hash-based on-chip memory distribution scheme and a unified dataflow architecture, enabling efficient processing of mixed sparsity patterns without requiring specialized sparse hardware. Evaluated on DiT and PixArt-Sigma models, DiSC achieves 2.48–4.74Γ— speedup over NVIDIA A100/H100 GPUs while reducing energy consumption by 46.4%–68.1%.
πŸ“ Abstract
Transformer-based diffusion models offer superior scalability and performance but suffer from high computational overhead due to the iterative nature and quadratic complexity of self-attention at high resolutions. In this paper, we propose DiSC, a resolution-scalable, sparsity-aware hardware accelerator. At the software level, DiSC introduces two algorithms: Cached Token Reuse (CTR), and Softmax Thresholding with Sparsity Mask Reuse (ST). CTR introduces a mechanism that translates spatial variations in the input latent difference across steps into a token-level reuse decision, effectively eliminating redundant token computation. ST induces sparsity in attention operations by reusing a generated sparsity pattern, leveraging temporal similarity to bypass costly prediction overhead. Together, these algorithms provide resolution-scalable computational benefits and yield a moderate sparsity and hybrid dense-sparse workload. To exploit this efficiently, we design a specialized hardware architecture and unified dataflow. This architecture avoids dedicated sparsity-handling components; instead, a hash-based distribution over on-chip memory banks allows DiSC to reuse its existing compute engines for sparse operations, efficiently exploiting the induced sparsity with minimal hardware overhead. Evaluated on DiT and PixArt-Sigma, DiSC achieves 3.47-4.74x and 2.48-3.50x speedups over NVIDIA A100 and H100 GPUs, respectively, with energy savings ranging from 46.4% to 68.1%.
Problem

Research questions and friction points this paper is trying to address.

diffusion models
computational overhead
self-attention
high resolution
quadratic complexity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cached Token Reuse
Sparsity-aware Acceleration
Hash-based Distribution
Resolution-Scalable
Diffusion Models