🤖 AI Summary
Transformer inference suffers from low efficiency, and existing gradient-based Head Importance Score (HIS) pruning methods neglect attention pattern diversity, leading to unstable pruning. To address this, we propose a unified pruning criterion that jointly incorporates attention entropy and HIS: entropy quantifies the diversity of attention distributions across heads, while HIS captures task-specific, gradient-driven contribution. By integrating these complementary signals, our method enables a more comprehensive and robust head importance assessment. This work is the first to introduce an information-theoretic perspective—specifically, attention entropy—into attention head pruning. Extensive experiments on multiple NLP benchmarks demonstrate that our approach achieves up to a 15.2% improvement in post-pruning model quality while enhancing pruning stability by 2.04×, all without sacrificing accuracy. The proposed framework establishes a new paradigm for efficient and reliable Transformer compression.
📝 Abstract
Transformer-based models have achieved remarkable performance in NLP tasks. However, their structural characteristics-multiple layers and attention heads-introduce efficiency challenges in inference and deployment. To address these challenges, various pruning methods have recently been proposed. Notably, gradient-based methods using Head Importance Scores (HIS) have gained traction for interpretability, efficiency, and ability to identify redundant heads. However, HIS alone has limitations as it captures only the gradient-driven contribution, overlooking the diversity of attention patterns. To overcome these limitations, we introduce a novel pruning criterion, HIES (Head Importance-Entropy Score), which integrates head importance scores with attention entropy, providing complementary evidence on per-head contribution. Empirically, HIES-based pruning yields up to 15.2% improvement in model quality and 2.04x improvement in stability over HIS-only methods, enabling substantial model compression without sacrificing either accuracy or stability. Code will be released upon publication.