🤖 AI Summary
To address the high computational cost and low inference efficiency of text-to-image diffusion models, this paper proposes a noise-relative-magnitude-aware token-level pruning and caching co-acceleration method. We innovatively design a cluster-aware token pruning mechanism that jointly integrates K-means spatial clustering with token importance estimation to dynamically retain visually salient tokens exhibiting both local consistency and semantic criticality. Furthermore, we introduce distribution-balanced sampling and cross-step token caching reuse to enhance computational efficiency. Under strict preservation of generation quality—measured by FID, CLIP Score, and other standard metrics—the method reduces inference FLOPs by 50–60%, significantly improving throughput and energy efficiency. Unlike conventional uniform or global pruning approaches, our method establishes a new paradigm for efficient diffusion model inference by enabling adaptive, semantics-guided token sparsification and intelligent reuse.
📝 Abstract
Diffusion models have revolutionized generative tasks, especially in the domain of text-to-image synthesis; however, their iterative denoising process demands substantial computational resources. In this paper, we present a novel acceleration strategy that integrates token-level pruning with caching techniques to tackle this computational challenge. By employing noise relative magnitude, we identify significant token changes across denoising iterations. Additionally, we enhance token selection by incorporating spatial clustering and ensuring distributional balance. Our experiments demonstrate reveal a 50%-60% reduction in computational costs while preserving the performance of the model, thereby markedly increasing the efficiency of diffusion models. The code is available at https://github.com/ada-cheng/CAT-Pruning