SERQ: Saliency-Aware Low-Rank Error Reconstruction for LLM Quantization

📅 2026-03-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the significant accuracy degradation commonly observed in low-bit (e.g., W4A4) large language models under post-training quantization, primarily caused by activation outliers and weight sensitivity. To mitigate this, the authors propose SERQ, a three-stage method comprising static activation flattening, saliency-aware error reconstruction, and offline weight permutation. SERQ introduces the first saliency-aware mechanism to guide low-rank error reconstruction and employs a single low-rank compensation matrix to jointly correct quantization errors in both weights and activations. Notably, it avoids intermediate quantization or online auxiliary layers, substantially reducing calibration complexity. Experiments demonstrate that SERQ outperforms existing error-reconstruction approaches under both W4A8 and W4A4 settings, surpassing the current state-of-the-art rotation-based W4A4 method in accuracy while significantly lowering calibration overhead.

Technology Category

Application Category

📝 Abstract
Post-training quantization (PTQ) has emerged as a prevailing technique for deploying large language models (LLMs) efficiently in terms of both memory and computation, across edge devices and server platforms. Existing PTQ methods primarily aim to reduce precision in weights and activations by mitigating quantization errors caused by channel-wise outlier activations (e.g., pre-quantization scaling, online transformations, or low-rank error reconstruction). Among these approaches, error reconstruction with low-rank adaptation (LoRA) has proven particularly effective, as it introduces a lightweight auxiliary computation path without requiring heavy optimization or additional online layers. However, prior studies reveal severe accuracy degradation under W4A4 settings, and conventional low-rank adaptations rely on two sequential factors, necessitating intermediate quantization during inference and thereby limiting low-precision efficiency. In this work, we propose SERQ, a saliency-aware error reconstruction method for low-bit LLM inference that employs a single low-rank compensation matrix. SERQ preserves efficient 4-bit matrix multiplication in linear layers by jointly mitigating quantization errors arising from both activation and weight saliency through three stages: (1) static activation flattening, (2) saliency-aware error reconstruction, and (3) offline weight permutation. The method incurs additional computation only for low-rank error reconstruction via a single decomposition, while all other operations are performed offline, thereby keeping latency overhead minimal. Empirically, SERQ outperforms prior error reconstruction methods under both W4A8 and W4A4 settings, and achieves higher accuracy than state-of-the-art rotation-based W4A4 approaches, while substantially reducing calibration complexity.
Problem

Research questions and friction points this paper is trying to address.

post-training quantization
low-bit inference
quantization error
large language models
low-rank adaptation
Innovation

Methods, ideas, or system contributions that make the work stand out.

saliency-aware
low-rank error reconstruction
post-training quantization
LLM quantization
W4A4
🔎 Similar Papers
No similar papers found.