Entropy-Based Block Pruning for Efficient Large Language Models

πŸ“… 2025-04-04
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address excessive computational and memory overhead in large language model deployment, this paper proposes a transformer module-level pruning method grounded in the entropy of hidden-layer representations. Unlike conventional redundancy criteria based on cosine similarity, our approach pioneers entropy as the core pruning metric, dynamically identifying and removing redundant computation blocks by modeling information uncertainty in hidden statesβ€”and uncovering the systematic increase of entropy with network depth. Evaluated across multiple benchmark tasks, the method achieves an average 32% parameter reduction while retaining over 98.5% of the original accuracy, significantly outperforming cosine-similarity-based baselines. Our principal contributions are: (i) establishing a theoretical linkage between hidden-layer entropy and module redundancy; and (ii) introducing an interpretable, generalizable structured pruning paradigm. This entropy-driven framework offers both principled insight into transformer internal dynamics and practical efficiency gains for model compression.

Technology Category

Application Category

πŸ“ Abstract
As large language models continue to scale, their growing computational and storage demands pose significant challenges for real-world deployment. In this work, we investigate redundancy within Transformer-based models and propose an entropy-based pruning strategy to enhance efficiency while maintaining performance. Empirical analysis reveals that the entropy of hidden representations decreases in the early blocks but progressively increases across most subsequent blocks. This trend suggests that entropy serves as a more effective measure of information richness within computation blocks. Unlike cosine similarity, which primarily captures geometric relationships, entropy directly quantifies uncertainty and information content, making it a more reliable criterion for pruning. Extensive experiments demonstrate that our entropy-based pruning approach surpasses cosine similarity-based methods in reducing model size while preserving accuracy, offering a promising direction for efficient model deployment.
Problem

Research questions and friction points this paper is trying to address.

Reduce computational demands of large language models
Identify redundancy in Transformer-based models using entropy
Prune models efficiently while maintaining performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Entropy-based pruning for Transformer models
Quantifies information richness via entropy
Outperforms cosine similarity in efficiency
πŸ”Ž Similar Papers
No similar papers found.