E$^3$-Pruner: Towards Efficient, Economical, and Effective Layer Pruning for Large Language Models

📅 2025-11-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address deployment bottlenecks in large language model (LLM) layer pruning—including substantial accuracy degradation, high training overhead, and limited speedup—this paper proposes an efficient, economical, and effective differentiable pruning framework. Methodologically, it introduces (1) a Gumbel-TopK differentiable masking mechanism for end-to-end learning of layer importance, and (2) an entropy-based adaptive knowledge distillation strategy that enhances task adaptation of retained layers with minimal cost—requiring only 0.5B tokens. Evaluated on Qwen3-32B, pruning 25% of layers yields 96% of the original accuracy and a 1.33× inference speedup, significantly outperforming existing state-of-the-art methods. The framework achieves an unprecedented balance among accuracy preservation, computational efficiency, and training economy, demonstrating strong practicality for LLM deployment.

Technology Category

Application Category

📝 Abstract
With the increasing size of large language models, layer pruning has gained increased attention as a hardware-friendly approach for model compression. However, existing layer pruning methods struggle to simultaneously address key practical deployment challenges, including performance degradation, high training costs, and limited acceleration. To overcome these limitations, we propose ame, a task-underline{E}ffective, training-underline{E}conomical and inference-underline{E}fficient layer pruning framework. amespace introduces two key innovations: (1) a differentiable mask optimization method using a Gumbel-TopK sampler, enabling efficient and precise pruning mask search; and (2) an entropy-aware adaptive knowledge distillation strategy that enhances task performance. Extensive experiments over diverse model architectures and benchmarks demonstrate the superiority of our method over state-of-the-art approaches. Notably, amespace achieves 96% accuracy, a mere 0.8% drop from the original model (96.8%) on MATH-500 when pruning 25% layers of Qwen3-32B, outperforming existing SOTA (95%), with a 1.33$ imes$ inference speedup by consuming merely 0.5B tokens (0.5% of the post-training data volume).
Problem

Research questions and friction points this paper is trying to address.

Addressing performance degradation in large language model layer pruning
Reducing high training costs associated with model compression
Overcoming limited inference acceleration in existing pruning methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Differentiable mask optimization using Gumbel-TopK sampler
Entropy-aware adaptive knowledge distillation strategy
Efficient precise pruning with minimal performance degradation
🔎 Similar Papers
No similar papers found.