Lightweight and Post-Training Structured Pruning for On-Device Large Lanaguage Models

๐Ÿ“… 2025-01-25
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the challenges of high memory overhead, severe accuracy degradation after pruning, and reliance on fine-tuning and labeled data when deploying large language models (LLMs) on edge devices, this paper proposes a lightweight, structured pruning method that requires no fine-tuning and preserves the original model architecture. Our approach features three key contributions: (1) a novel hybrid-granularity pruning strategyโ€”coarse-grained across layers and fine-grained at the neuron level within layers; (2) an unsupervised neuron importance scoring mechanism based on matrix condition number, eliminating the need for ground-truth labels; and (3) mask tuning to recover accuracy without any training data. Evaluated on LLaMA-2-7B, our method achieves a 6.13% accuracy improvement over LLM-Pruner at a 20% pruning ratio, reduces memory footprint by 80%, and enables plug-and-play deployment on resource-constrained edge devices.

Technology Category

Application Category

๐Ÿ“ Abstract
Considering the hardware-friendly characteristics and broad applicability, structured pruning has emerged as an efficient solution to reduce the resource demands of large language models (LLMs) on resource-constrained devices. Traditional structured pruning methods often need fine-tuning to recover performance loss, which incurs high memory overhead and substantial data requirements, rendering them unsuitable for on-device applications. Additionally, post-training structured pruning techniques typically necessitate specific activation functions or architectural modifications, thereby limiting their scope of applications. Herein, we introduce COMP, a lightweight post-training structured pruning method that employs a hybrid-granularity pruning strategy. COMP initially prunes selected model layers based on their importance at a coarse granularity, followed by fine-grained neuron pruning within the dense layers of each remaining model layer. To more accurately evaluate neuron importance, COMP introduces a new matrix condition-based metric. Subsequently, COMP utilizes mask tuning to recover accuracy without the need for fine-tuning, significantly reducing memory consumption. Experimental results demonstrate that COMP improves performance by 6.13% on the LLaMA-2-7B model with a 20% pruning ratio compared to LLM-Pruner, while simultaneously reducing memory overhead by 80%.
Problem

Research questions and friction points this paper is trying to address.

Model Pruning
Resource Efficiency
Performance Preservation
Innovation

Methods, ideas, or system contributions that make the work stand out.

COMP method
hybrid pruning strategy
neuron evaluation
๐Ÿ”Ž Similar Papers
No similar papers found.
Z
Zihuai Xu
School of Computer Science and Technology, University of Science and Technology of China; Suzhou Institute for Advanced Research, University of Science and Technology of China
Y
Yang Xu
School of Computer Science and Technology, University of Science and Technology of China; Suzhou Institute for Advanced Research, University of Science and Technology of China
Hongli Xu
Hongli Xu
University of Science and Technology of China
Software Defined NetworkCooperative CommunicationSensor Networks
Yunming Liao
Yunming Liao
University of Science and Technology of China
Edge IntelligenceEdge ComputingFederated LearningSplit Federated Learning
Zhiwei Yao
Zhiwei Yao
University of Science and Technology of China
Edge ComputingFederated Learning
Z
Zuan Xie
School of Computer Science and Technology, University of Science and Technology of China; Suzhou Institute for Advanced Research, University of Science and Technology of China