🤖 AI Summary
To address deployment bottlenecks in large language model (LLM) layer pruning—including substantial accuracy degradation, high training overhead, and limited speedup—this paper proposes an efficient, economical, and effective differentiable pruning framework. Methodologically, it introduces (1) a Gumbel-TopK differentiable masking mechanism for end-to-end learning of layer importance, and (2) an entropy-based adaptive knowledge distillation strategy that enhances task adaptation of retained layers with minimal cost—requiring only 0.5B tokens. Evaluated on Qwen3-32B, pruning 25% of layers yields 96% of the original accuracy and a 1.33× inference speedup, significantly outperforming existing state-of-the-art methods. The framework achieves an unprecedented balance among accuracy preservation, computational efficiency, and training economy, demonstrating strong practicality for LLM deployment.
📝 Abstract
With the increasing size of large language models, layer pruning has gained increased attention as a hardware-friendly approach for model compression. However, existing layer pruning methods struggle to simultaneously address key practical deployment challenges, including performance degradation, high training costs, and limited acceleration. To overcome these limitations, we propose
ame, a task-underline{E}ffective, training-underline{E}conomical and inference-underline{E}fficient layer pruning framework.
amespace introduces two key innovations: (1) a differentiable mask optimization method using a Gumbel-TopK sampler, enabling efficient and precise pruning mask search; and (2) an entropy-aware adaptive knowledge distillation strategy that enhances task performance. Extensive experiments over diverse model architectures and benchmarks demonstrate the superiority of our method over state-of-the-art approaches. Notably,
amespace achieves 96% accuracy, a mere 0.8% drop from the original model (96.8%) on MATH-500 when pruning 25% layers of Qwen3-32B, outperforming existing SOTA (95%), with a 1.33$ imes$ inference speedup by consuming merely 0.5B tokens (0.5% of the post-training data volume).