GradPruner: Gradient-Guided Layer Pruning Enabling Efficient Fine-Tuning and Inference for LLMs

📅 2026-01-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes GradPruner, a novel structured pruning method that addresses the inefficiencies in training and inference commonly encountered during fine-tuning of large language models. Unlike existing approaches that incur additional overhead, GradPruner leverages accumulated parameter gradients in the early stages of fine-tuning to construct an Initial Gradient Information Accumulation Matrix (IGIA-Matrix), which is used to identify and prune redundant layers. Pruned layers are then merged with retained ones based on sign consistency, eliminating the need for extra training or search procedures. Evaluated on two prominent large language models across eight downstream tasks, GradPruner achieves a 40% reduction in model parameters with only a 0.99% average accuracy drop, substantially improving both training and inference efficiency.

Technology Category

Application Category

📝 Abstract
Fine-tuning Large Language Models (LLMs) with downstream data is often considered time-consuming and expensive. Structured pruning methods are primarily employed to improve the inference efficiency of pre-trained models. Meanwhile, they often require additional time and memory for training, knowledge distillation, structure search, and other strategies, making efficient model fine-tuning challenging to achieve. To simultaneously enhance the training and inference efficiency of downstream task fine-tuning, we introduce GradPruner, which can prune layers of LLMs guided by gradients in the early stages of fine-tuning. GradPruner uses the cumulative gradients of each parameter during the initial phase of fine-tuning to compute the Initial Gradient Information Accumulation Matrix (IGIA-Matrix) to assess the importance of layers and perform pruning. We sparsify the pruned layers based on the IGIA-Matrix and merge them with the remaining layers. Only elements with the same sign are merged to reduce interference from sign variations. We conducted extensive experiments on two LLMs across eight downstream datasets. Including medical, financial, and general benchmark tasks. The results demonstrate that GradPruner has achieved a parameter reduction of 40% with only a 0.99% decrease in accuracy. Our code is publicly available.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Efficient Fine-tuning
Layer Pruning
Inference Efficiency
Training Efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Gradient-Guided Pruning
Layer Pruning
Efficient Fine-Tuning
LLM Compression
IGIA-Matrix
🔎 Similar Papers
No similar papers found.
Wei Huang
Wei Huang
Google, Inc
Program AnalysisType InferenceWeb/Mobile Security
A
Anda Cheng
Ant Group, Beijing, China
Y
Yinggui Wang
Ant Group, Beijing, China