Entropy Meets Importance: A Unified Head Importance-Entropy Score for Stable and Efficient Transformer Pruning

📅 2025-10-10

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

Transformer inference suffers from low efficiency, and existing gradient-based Head Importance Score (HIS) pruning methods neglect attention pattern diversity, leading to unstable pruning. To address this, we propose a unified pruning criterion that jointly incorporates attention entropy and HIS: entropy quantifies the diversity of attention distributions across heads, while HIS captures task-specific, gradient-driven contribution. By integrating these complementary signals, our method enables a more comprehensive and robust head importance assessment. This work is the first to introduce an information-theoretic perspective—specifically, attention entropy—into attention head pruning. Extensive experiments on multiple NLP benchmarks demonstrate that our approach achieves up to a 15.2% improvement in post-pruning model quality while enhancing pruning stability by 2.04×, all without sacrificing accuracy. The proposed framework establishes a new paradigm for efficient and reliable Transformer compression.

Technology Category

Application Category

📝 Abstract

Transformer-based models have achieved remarkable performance in NLP tasks. However, their structural characteristics-multiple layers and attention heads-introduce efficiency challenges in inference and deployment. To address these challenges, various pruning methods have recently been proposed. Notably, gradient-based methods using Head Importance Scores (HIS) have gained traction for interpretability, efficiency, and ability to identify redundant heads. However, HIS alone has limitations as it captures only the gradient-driven contribution, overlooking the diversity of attention patterns. To overcome these limitations, we introduce a novel pruning criterion, HIES (Head Importance-Entropy Score), which integrates head importance scores with attention entropy, providing complementary evidence on per-head contribution. Empirically, HIES-based pruning yields up to 15.2% improvement in model quality and 2.04x improvement in stability over HIS-only methods, enabling substantial model compression without sacrificing either accuracy or stability. Code will be released upon publication.

Problem

Research questions and friction points this paper is trying to address.

Improves transformer pruning by combining importance and entropy scores

Addresses limitations of gradient-only head importance evaluation methods

Enhances model compression while maintaining accuracy and stability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines head importance scores with attention entropy

Enables substantial model compression without sacrificing accuracy

Improves model quality and stability over previous methods

🔎 Similar Papers

BlockPruner: Fine-grained Pruning for Large Language Models