Adaptive Feature-based Low-Rank Compression of Large Language Models via Bayesian Optimization

📅 2024-05-17
🏛️ Conference on Empirical Methods in Natural Language Processing
📈 Citations: 7
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) incur substantial computational overhead during deployment, necessitating efficient compression techniques. Method: This paper proposes an adaptive low-rank compression framework. First, it systematically characterizes the hierarchical low-rank structure inherent in LLM weights (e.g., LLaMA-2). Second, it introduces a feature distribution modeling mechanism based on pooled covariance matrices to accurately estimate layer-wise input feature statistics. Third, it employs Bayesian optimization to dynamically allocate optimal low-rank dimensions per layer, jointly optimizing compression ratio and task performance. Contribution/Results: Compared to existing structured pruning and low-rank methods, our approach achieves significantly higher downstream task accuracy at equivalent compression ratios. Experimental results validate both the effectiveness and generalizability of hierarchical adaptive low-rank modeling for LLM compression.

Technology Category

Application Category

📝 Abstract
In recent years, large language models (LLMs) have driven advances in natural language processing. Still, their growing scale has increased the computational burden, necessitating a balance between efficiency and performance. Low-rank compression, a promising technique, reduces non-essential parameters by decomposing weight matrices into products of two low-rank matrices. Yet, its application in LLMs has not been extensively studied. The key to low-rank compression lies in low-rank factorization and low-rank dimensions allocation. To address the challenges of low-rank compression in LLMs, we conduct empirical research on the low-rank characteristics of large models. We propose a low-rank compression method suitable for LLMs. This approach involves precise estimation of feature distributions through pooled covariance matrices and a Bayesian optimization strategy for allocating low-rank dimensions. Experiments on the LLaMA-2 models demonstrate that our method outperforms existing strong structured pruning and low-rank compression techniques in maintaining model performance at the same compression ratio.
Problem

Research questions and friction points this paper is trying to address.

Optimize low-rank compression for large language models
Balance efficiency and performance in LLMs
Enhance model performance with Bayesian optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

low-rank compression
Bayesian optimization
pooled covariance matrices
🔎 Similar Papers
No similar papers found.