🤖 AI Summary
Existing joint optimization methods conflate the roles of weight quantization and low-rank decomposition, leading to functional coupling and mutual interference, thereby limiting compression efficacy. To address this, we propose a functional decoupling framework: low-rank components exclusively model activation-sensitive outlier weights, while quantized components handle standard weight compression. We introduce Outlier-Driven Low-Rank Initialization (ODLRI)—the first method to explicitly assign the semantic role of “capturing activation-sensitive outliers” to low-rank structure. Our end-to-end training paradigm integrates INT2/INT3 quantization, low-rank decomposition (LR), activation-aware error modeling, and joint optimization. Evaluated on Llama2/3 and Mistral, our approach significantly reduces activation-aware reconstruction error and quantization scale factors, achieving improved perplexity and zero-shot task accuracy at 2–3 bits.
📝 Abstract
Decomposing weight matrices into quantization and low-rank components ($mathbf{W} approx mathbf{Q} + mathbf{L}mathbf{R}$) is a widely used technique for compressing large language models (LLMs). Existing joint optimization methods iteratively alternate between quantization and low-rank approximation. However, these methods tend to prioritize one component at the expense of the other, resulting in suboptimal decompositions that fail to leverage each component's unique strengths. In this work, we introduce Outlier-Driven Low-Rank Initialization (ODLRI), which assigns low-rank components the specific role of capturing activation-sensitive weights. This structured decomposition mitigates outliers' negative impact on quantization, enabling more effective balance between quantization and low-rank approximation. Experiments on Llama2 (7B, 13B, 70B), Llama3-8B, and Mistral-7B demonstrate that incorporating ODLRI into the joint optimization framework consistently reduces activation-aware error, minimizes quantization scale, and improves perplexity and zero-shot accuracy in low-bit settings.