Beyond 2:4: exploring V:N:M sparsity for efficient transformer inference on GPUs

📅 2024-10-21

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

232K/year

🤖 AI Summary

Existing 2:4 sparsity patterns on GPUs yield limited acceleration (≤1.3×) in Transformer inference, suffer from fixed sparsity ratios, and are difficult to scale to higher sparsity levels (e.g., 4:8, 8:16, or >50%). Method: This work systematically explores V:N:M sparsity patterns, introducing three key techniques: (i) sparse-channel-aware weight reordering, (ii) a three-stage LoRA fine-tuning strategy, and (iii) GPU-sparse-tensor-core-compatible optimizations. Contribution/Results: We demonstrate, for the first time, the broad efficacy of V:N:M sparsity across ViTs and LLMs. A heuristic V/M selection strategy enables high sparsity (e.g., 64:2:5 or 64:2:8) without accuracy loss. Experiments show zero accuracy degradation on DeiT-small (64:2:5) and DeiT-base (64:2:8); Llama2-7B (64:2:5) outperforms standard 2:4 sparsity on downstream tasks, achieving superior speed–accuracy trade-offs.

Technology Category

Application Category

📝 Abstract

To date, 2:4 sparsity has stood as the only sparse pattern that can be accelerated using sparse tensor cores on GPUs. In practice, 2:4 sparsity often possesses low actual speedups ($leq 1.3$) and requires fixed sparse ratios, meaning that other ratios, such as 4:8, 8:16, or those exceeding 50% sparsity, do not incur any speedups on GPUs. Recent studies suggest that V:N:M sparsity is promising in addressing these limitations of 2:4 sparsity. However, regarding accuracy, the effects of V:N:M sparsity on broader Transformer models, such as vision Transformers and large language models (LLMs), are largely unexamined. Moreover, Some specific issues related to V:N:M sparsity, such as how to select appropriate V and M values, remain unresolved. In this study, we thoroughly investigate the application of V:N:M sparsity in vision models and LLMs across multiple tasks, from pertaining to downstream tasks. We propose three key approaches to enhance the applicability and accuracy of V:N:M-sparse Transformers, including heuristic V and M selection, V:N:M-specific channel permutation, and three-staged LoRA training techniques. Experimental results show that, with our methods, the DeiT-small achieves lossless accuracy at 64:2:5 sparsity, while the DeiT-base maintains accuracy even at 64:2:8 sparsity. In addition, the fine-tuned LLama2-7B at 64:2:5 sparsity performs comparably or better than training-free 2:4 sparse alternatives on downstream tasks. More importantly, V:N:M-sparse Transformers offer a wider range of speedup-accuracy trade-offs compared to 2:4 sparsity. Overall, our exploration largely facilitates the V:N:M sparsity to act as a truly effective acceleration solution for Transformers in cost-sensitive inference scenarios.

Problem

Research questions and friction points this paper is trying to address.

Exploring V:N:M sparsity for efficient transformer inference.

Addressing limitations of 2:4 sparsity in GPU acceleration.

Enhancing V:N:M sparsity applicability in vision and language models.

Innovation

Methods, ideas, or system contributions that make the work stand out.

V:N:M sparsity for GPUs

Heuristic V and M selection

Three-staged LoRA training

🔎 Similar Papers

No similar papers found.