🤖 AI Summary
Existing structured pruning methods for large language models (LLMs) suffer significant performance degradation in zero-shot settings and rely on supervised fine-tuning (SFT) or adapter-based recovery to restore accuracy.
Method: We propose an efficient and robust pruning framework comprising three key components: (i) a first-order saliency criterion derived from the neural tangent kernel and Adam dynamics to precisely identify redundant hidden units; (ii) a cross-layer, cross-module adaptive sparsity allocation mechanism to improve structural compression rationality; and (iii) a KL-divergence-guided calibration data selection strategy to enhance generalization.
Results: Evaluated on Llama3, Qwen, and T5, our method achieves state-of-the-art performance at equivalent sparsity levels—maintaining superior zero-shot accuracy without SFT or adapters, while preserving strong reasoning and transfer capabilities. It strikes a better trade-off between compression ratio and model performance.
📝 Abstract
Structured pruning of large language models (LLMs) offers substantial efficiency improvements by removing entire hidden units, yet current approaches often suffer from significant performance degradation, particularly in zero-shot settings, and necessitate costly recovery techniques such as supervised fine-tuning (SFT) or adapter insertion. To address these critical shortcomings, we introduce NIRVANA, a novel pruning method explicitly designed to balance immediate zero-shot accuracy preservation with robust fine-tuning capability. Leveraging a first-order saliency criterion derived from the Neural Tangent Kernel under Adam optimization dynamics, NIRVANA provides a theoretically grounded pruning strategy that respects essential model training behaviors. To further address the unique challenges posed by structured pruning, NIRVANA incorporates an adaptive sparsity allocation mechanism across layers and modules (attention vs. MLP), which adjusts pruning intensity between modules in a globally balanced manner. Additionally, to mitigate the high sensitivity of pruning decisions to calibration data quality, we propose a simple yet effective KL divergence-based calibration data selection strategy, ensuring more reliable and task-agnostic pruning outcomes. Comprehensive experiments conducted on Llama3, Qwen, and T5 models demonstrate that NIRVANA outperforms existing structured pruning methods under equivalent sparsity constraints, providing a theoretically sound and practical approach to LLM compression. The code is available at https://github.com/iDEA-iSAIL-Lab-UIUC/NIRVANA.