🤖 AI Summary
Large language models (LLMs) incur substantial computational overhead during deployment, necessitating efficient compression techniques. Method: This paper proposes an adaptive low-rank compression framework. First, it systematically characterizes the hierarchical low-rank structure inherent in LLM weights (e.g., LLaMA-2). Second, it introduces a feature distribution modeling mechanism based on pooled covariance matrices to accurately estimate layer-wise input feature statistics. Third, it employs Bayesian optimization to dynamically allocate optimal low-rank dimensions per layer, jointly optimizing compression ratio and task performance. Contribution/Results: Compared to existing structured pruning and low-rank methods, our approach achieves significantly higher downstream task accuracy at equivalent compression ratios. Experimental results validate both the effectiveness and generalizability of hierarchical adaptive low-rank modeling for LLM compression.
📝 Abstract
In recent years, large language models (LLMs) have driven advances in natural language processing. Still, their growing scale has increased the computational burden, necessitating a balance between efficiency and performance. Low-rank compression, a promising technique, reduces non-essential parameters by decomposing weight matrices into products of two low-rank matrices. Yet, its application in LLMs has not been extensively studied. The key to low-rank compression lies in low-rank factorization and low-rank dimensions allocation. To address the challenges of low-rank compression in LLMs, we conduct empirical research on the low-rank characteristics of large models. We propose a low-rank compression method suitable for LLMs. This approach involves precise estimation of feature distributions through pooled covariance matrices and a Bayesian optimization strategy for allocating low-rank dimensions. Experiments on the LLaMA-2 models demonstrate that our method outperforms existing strong structured pruning and low-rank compression techniques in maintaining model performance at the same compression ratio.