FlexiGPT: Pruning and Extending Large Language Models with Low-Rank Weight Sharing

📅 2025-01-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To enable efficient deployment of large language models (LLMs) on memory-constrained devices, this paper proposes an importance-driven modular pruning and low-rank weight sharing framework. The method jointly optimizes model compression and adaptation: (1) block-level importance scoring guides fine-grained pruning; (2) retained modules are reused across layers, and weight sharing is enforced across modules to enhance parameter efficiency; (3) block-specific low-rank adapters (LoRA-style) enable lightweight fine-tuning. Output feature normalization and SVD-based adapter initialization are integrated to ensure training stability and convergence. Experiments demonstrate state-of-the-art (SOTA) performance: at 30% compression, the method achieves SOTA on 5 out of 6 benchmarks; at 40% compression, it attains SOTA on all 6 benchmarks. Remarkably, only 0.3% additional training tokens suffice to significantly improve performance across all six evaluation tasks.

Technology Category

Application Category

📝 Abstract
The rapid proliferation of large language models (LLMs) in natural language processing (NLP) has created a critical need for techniques that enable efficient deployment on memory-constrained devices without compromising performance. We present a method to prune LLMs that selectively prunes model blocks based on an importance score and replaces them with a low-parameter replacement strategy. Specifically, we propose a principled metric to replace each pruned block using a weight-sharing mechanism that leverages unpruned counterparts from the model and block-specific low-rank adapters. Furthermore, we facilitate the learning of these replacement blocks with output feature normalization and an adapter initialization scheme built on low-rank SVD reconstructions. Empirical evaluations demonstrate substantial performance gains over existing methods, achieving state-of-the-art performance on 5/6 benchmarks for a compression rate of 30% and 6/6 benchmarks for a compression rate of 40%. We also demonstrate that our approach can extend smaller models, boosting performance on 6/6 benchmarks using only ~0.3% tokens of extended training with minimal additional parameter costs.
Problem

Research questions and friction points this paper is trying to address.

Large Language Model
Size Reduction
Performance Maintenance
Innovation

Methods, ideas, or system contributions that make the work stand out.

FlexiGPT
Efficient Model Optimization
Memory-constrained Devices
🔎 Similar Papers
No similar papers found.