Flexible Operator Fusion for Fast Sparse Transformer with Diverse Masking on GPU

📅 2025-06-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Sparse Transformers suffer from suboptimal GPU computational efficiency due to dynamic sequence lengths, heterogeneous attention mask structures, and limited operator fusion. To address this, we propose STOF—a unified framework enabling mask-aware cross-operator fusion and a single, flexible kernel for multi-head attention (MHA). STOF supports arbitrary sparsity patterns and dynamic sequence lengths while unifying computation across sparse attention variants. Its core innovation is a two-stage compilation-time template search engine that jointly applies heuristic and exhaustive search to automatically optimize fusion strategies under arbitrary mask constraints, overcoming the expressiveness limitations of rule-based fusion approaches. Experiments demonstrate up to 1.7× speedup in MHA computation and 1.5× end-to-end inference acceleration, significantly outperforming state-of-the-art sparse Transformer implementations.

Technology Category

Application Category

📝 Abstract
Large language models are popular around the world due to their powerful understanding capabilities. As the core component of LLMs, accelerating Transformer through parallelization has gradually become a hot research topic. Mask layers introduce sparsity into Transformer to reduce calculations. However, previous works rarely focus on the performance optimization of sparse Transformer. Moreover, rule-based mechanisms ignore the fusion opportunities of mixed-type operators and fail to adapt to various sequence lengths. To address the above problems, we propose STOF, a framework that incorporates optimizations for Sparse Transformer via flexible masking and operator fusion on GPU. We firstly unify the storage format and kernel implementation for the multi-head attention. Then, we map fusion schemes to compilation templates and determine the optimal parameter setting through a two-stage search engine. The experimental results show that compared to the state-of-the-art work, STOF achieves maximum speedups of 1.7x in MHA computation and 1.5x in end-to-end inference.
Problem

Research questions and friction points this paper is trying to address.

Optimizing sparse Transformer performance on GPU
Enhancing operator fusion for diverse masking types
Adapting to variable sequence lengths efficiently
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified storage format for multi-head attention
Flexible operator fusion via compilation templates
Two-stage search engine for optimal parameters
🔎 Similar Papers
No similar papers found.
W
Wenhao Dai
China University of Petroleum-Beijing, Beijing, China
H
Haodong Deng
China University of Petroleum-Beijing, Beijing, China
M
Mengfei Rong
China University of Petroleum-Beijing, Beijing, China
X
Xinyu Yang
Beihang University, Beijing, China
Hongyu Liu
Hongyu Liu
HKUST
Computer Vision
Fangxin Liu
Fangxin Liu
Shanghai Jiao Tong University
In-memory Computing、Brian-inspired Neuromorphic Computing
H
Hailong Yang
Beihang University, Beijing, China
Weifeng Liu
Weifeng Liu
University of Florida
Machine LearningSignal ProcessingKernel adaptive filtering
Qingxiao Sun
Qingxiao Sun
China University of Petroleum, Beijing
GPU ArchitectureHPCDeep Learning