Assigning Distinct Roles to Quantized and Low-Rank Matrices Toward Optimal Weight Decomposition

📅 2025-06-02

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

Existing joint optimization methods conflate the roles of weight quantization and low-rank decomposition, leading to functional coupling and mutual interference, thereby limiting compression efficacy. To address this, we propose a functional decoupling framework: low-rank components exclusively model activation-sensitive outlier weights, while quantized components handle standard weight compression. We introduce Outlier-Driven Low-Rank Initialization (ODLRI)—the first method to explicitly assign the semantic role of “capturing activation-sensitive outliers” to low-rank structure. Our end-to-end training paradigm integrates INT2/INT3 quantization, low-rank decomposition (LR), activation-aware error modeling, and joint optimization. Evaluated on Llama2/3 and Mistral, our approach significantly reduces activation-aware reconstruction error and quantization scale factors, achieving improved perplexity and zero-shot task accuracy at 2–3 bits.

Technology Category

Application Category

📝 Abstract

Decomposing weight matrices into quantization and low-rank components ($mathbf{W} approx mathbf{Q} + mathbf{L}mathbf{R}$) is a widely used technique for compressing large language models (LLMs). Existing joint optimization methods iteratively alternate between quantization and low-rank approximation. However, these methods tend to prioritize one component at the expense of the other, resulting in suboptimal decompositions that fail to leverage each component's unique strengths. In this work, we introduce Outlier-Driven Low-Rank Initialization (ODLRI), which assigns low-rank components the specific role of capturing activation-sensitive weights. This structured decomposition mitigates outliers' negative impact on quantization, enabling more effective balance between quantization and low-rank approximation. Experiments on Llama2 (7B, 13B, 70B), Llama3-8B, and Mistral-7B demonstrate that incorporating ODLRI into the joint optimization framework consistently reduces activation-aware error, minimizes quantization scale, and improves perplexity and zero-shot accuracy in low-bit settings.

Problem

Research questions and friction points this paper is trying to address.

Optimizing weight decomposition for LLM compression

Balancing quantization and low-rank approximation effectively

Mitigating outliers' impact on quantization performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decompose weights into quantization and low-rank matrices

Outlier-Driven Low-Rank Initialization (ODLRI) technique

Balances quantization and low-rank approximation effectively

🔎 Similar Papers

Unified Framework for Neural Network Compression via Decomposition and Optimal Rank Selection