Lightweight and Post-Training Structured Pruning for On-Device Large Lanaguage Models

📅 2025-01-25

📈 Citations: 0

✨ Influential: 0

career value

233K/year

🤖 AI Summary

To address the challenges of high memory overhead, severe accuracy degradation after pruning, and reliance on fine-tuning and labeled data when deploying large language models (LLMs) on edge devices, this paper proposes a lightweight, structured pruning method that requires no fine-tuning and preserves the original model architecture. Our approach features three key contributions: (1) a novel hybrid-granularity pruning strategy—coarse-grained across layers and fine-grained at the neuron level within layers; (2) an unsupervised neuron importance scoring mechanism based on matrix condition number, eliminating the need for ground-truth labels; and (3) mask tuning to recover accuracy without any training data. Evaluated on LLaMA-2-7B, our method achieves a 6.13% accuracy improvement over LLM-Pruner at a 20% pruning ratio, reduces memory footprint by 80%, and enables plug-and-play deployment on resource-constrained edge devices.

Technology Category

Application Category

📝 Abstract

Considering the hardware-friendly characteristics and broad applicability, structured pruning has emerged as an efficient solution to reduce the resource demands of large language models (LLMs) on resource-constrained devices. Traditional structured pruning methods often need fine-tuning to recover performance loss, which incurs high memory overhead and substantial data requirements, rendering them unsuitable for on-device applications. Additionally, post-training structured pruning techniques typically necessitate specific activation functions or architectural modifications, thereby limiting their scope of applications. Herein, we introduce COMP, a lightweight post-training structured pruning method that employs a hybrid-granularity pruning strategy. COMP initially prunes selected model layers based on their importance at a coarse granularity, followed by fine-grained neuron pruning within the dense layers of each remaining model layer. To more accurately evaluate neuron importance, COMP introduces a new matrix condition-based metric. Subsequently, COMP utilizes mask tuning to recover accuracy without the need for fine-tuning, significantly reducing memory consumption. Experimental results demonstrate that COMP improves performance by 6.13% on the LLaMA-2-7B model with a 20% pruning ratio compared to LLM-Pruner, while simultaneously reducing memory overhead by 80%.

Problem

Research questions and friction points this paper is trying to address.

Model Pruning

Resource Efficiency

Performance Preservation

Innovation

Methods, ideas, or system contributions that make the work stand out.

COMP method

hybrid pruning strategy

neuron evaluation

🔎 Similar Papers

Toward Adaptive Large Language Models Structured Pruning via Hybrid-grained Weight Importance Assessment

2024-03-16Citations: 1

Cerebras Systems

Sunnyvale CA or Toronto Canada / Headquarters/Sunnyvale Office, Sunnyvale, CA / Toronto Office, Toronto, Ontario, Canada

Natural Language Processing Researcher

Kitware

Arlington, Virginia

Authors to Follow