SEVEN: Pruning Transformer Model by Reserving Sentinels

📅 2024-03-19
🏛️ IEEE International Joint Conference on Neural Network
📈 Citations: 3
Influential: 0
📄 PDF
🤖 AI Summary
Gradient noise in Transformer pruning causes inaccurate weight importance estimation, leading to poor sparse model performance and heightened sensitivity to sparsity. Method: This paper proposes a dynamic sensitivity assessment mechanism based on symbolic descent accumulation—introducing “sentinel weights” to precisely identify low-noise, high-consistency sensitive parameters. It integrates dynamic importance scoring with both structured and unstructured pruning, and supports multi-task fine-tuning adaptation. Contributions/Results: Extensive experiments across NLP, question answering, and image classification tasks demonstrate that our method consistently outperforms state-of-the-art pruning approaches under 50%–90% sparsity, achieving higher accuracy and superior robustness across diverse fine-tuning strategies.

Technology Category

Application Category

📝 Abstract
Large-scale Transformer models (TM) have demonstrated outstanding performance across various tasks. However, their considerable parameter size restricts their applicability, particularly on mobile devices. Due to the dynamic and intricate nature of gradients on TM compared to Convolutional Neural Networks, commonly used pruning methods tend to retain weights with larger gradient noise. This results in pruned models that are sensitive to sparsity and datasets, exhibiting suboptimal performance. Symbolic Descent (SD) is a general approach for training and fine-tuning TM. In this paper, we attempt to describe the noisy batch gradient sequences on TM through the cumulative process of SD. We utilize this design to dynamically assess the importance scores of weights.SEVEN is introduced by us, which particularly favors weights with consistently high sensitivity, i.e., weights with small gradient noise. These weights are tended to be preserved by SEVEN. Extensive experiments on various TM in natural language, question-answering, and image classification domains are conducted to validate the effectiveness of SEVEN. The results demonstrate significant improvements of SEVEN in multiple pruning scenarios and across different sparsity levels. Additionally, SEVEN exhibits robust performance under various fine-tuning strategies. The code is publicly available at https://github.com/xiaojinying/SEVEN.
Problem

Research questions and friction points this paper is trying to address.

Pruning large Transformer models for mobile deployment
Reducing sensitivity to sparsity and dataset variations
Identifying weights with consistent high importance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pruning via symbolic descent gradients
Dynamic importance scoring for weights
Preserving low-noise high-sensitivity weights
🔎 Similar Papers
No similar papers found.
J
Jinying Xiao
School of Computer and Communication Engineering, Changsha University of Science and Technology, Changsha, China
P
Ping Li
School of Computer and Communication Engineering, Changsha University of Science and Technology, Changsha, China
J
Jie Nie
School of Computer and Communication Engineering, Changsha University of Science and Technology, Changsha, China
Zhe Tang
Zhe Tang
University of Liverpool
WSNIoThybrid network