$D^2Prune$: Sparsifying Large Language Models via Dual Taylor Expansion and Attention Distribution Awareness

📅 2026-01-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high computational cost of deploying large language models, a challenge exacerbated by existing pruning methods that overlook activation distribution shifts between calibration and test data and fail to account for the long-tailed activation patterns in attention mechanisms. To overcome these limitations, the authors propose $D^2Prune$, a novel pruning framework that jointly models weight and activation perturbations via dual Taylor expansions to accurately estimate pruning error. It further introduces an attention-aware dynamic update strategy that optimizes a combined objective of KL divergence and reconstruction error, effectively preserving critical long-tailed attention patterns. Extensive experiments demonstrate that $D^2Prune$ consistently outperforms state-of-the-art pruning techniques across diverse architectures—including OPT-125M, LLaMA2/3, Qwen3, and DeiT—achieving higher accuracy on benchmarks such as ImageNet-1K.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) face significant deployment challenges due to their massive computational demands. % While pruning offers a promising compression solution, existing methods suffer from two critical limitations: (1) They neglect activation distribution shifts between calibration data and test data, resulting in inaccurate error estimations; (2) They overlook the long-tail distribution characteristics of activations in the attention module. To address these limitations, this paper proposes a novel pruning method, $D^2Prune$. First, we propose a dual Taylor expansion-based method that jointly models weight and activation perturbations for precise error estimation, leading to precise pruning mask selection and weight updating and facilitating error minimization during pruning. % Second, we propose an attention-aware dynamic update strategy that preserves the long-tail attention pattern by jointly minimizing the KL divergence of attention distributions and the reconstruction error. Extensive experiments show that $D^2Prune$ consistently outperforms SOTA methods across various LLMs (e.g., OPT-125M, LLaMA2/3, and Qwen3). Moreover, the dynamic attention update mechanism also generalizes well to ViT-based vision models like DeiT, achieving superior accuracy on ImageNet-1K.
Problem

Research questions and friction points this paper is trying to address.

pruning
activation distribution shift
long-tail distribution
attention mechanism
large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual Taylor Expansion
Attention Distribution Awareness
Model Pruning
Long-tail Activation
KL Divergence Minimization
🔎 Similar Papers
No similar papers found.