🤖 AI Summary
To enable efficient deployment of large language models (LLMs) on memory-constrained devices, this paper proposes an importance-driven modular pruning and low-rank weight sharing framework. The method jointly optimizes model compression and adaptation: (1) block-level importance scoring guides fine-grained pruning; (2) retained modules are reused across layers, and weight sharing is enforced across modules to enhance parameter efficiency; (3) block-specific low-rank adapters (LoRA-style) enable lightweight fine-tuning. Output feature normalization and SVD-based adapter initialization are integrated to ensure training stability and convergence. Experiments demonstrate state-of-the-art (SOTA) performance: at 30% compression, the method achieves SOTA on 5 out of 6 benchmarks; at 40% compression, it attains SOTA on all 6 benchmarks. Remarkably, only 0.3% additional training tokens suffice to significantly improve performance across all six evaluation tasks.
📝 Abstract
The rapid proliferation of large language models (LLMs) in natural language processing (NLP) has created a critical need for techniques that enable efficient deployment on memory-constrained devices without compromising performance. We present a method to prune LLMs that selectively prunes model blocks based on an importance score and replaces them with a low-parameter replacement strategy. Specifically, we propose a principled metric to replace each pruned block using a weight-sharing mechanism that leverages unpruned counterparts from the model and block-specific low-rank adapters. Furthermore, we facilitate the learning of these replacement blocks with output feature normalization and an adapter initialization scheme built on low-rank SVD reconstructions. Empirical evaluations demonstrate substantial performance gains over existing methods, achieving state-of-the-art performance on 5/6 benchmarks for a compression rate of 30% and 6/6 benchmarks for a compression rate of 40%. We also demonstrate that our approach can extend smaller models, boosting performance on 6/6 benchmarks using only ~0.3% tokens of extended training with minimal additional parameter costs.