D$^2$Quant: Accurate Low-bit Post-Training Weight Quantization for LLMs

📅 2026-01-30

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

This work addresses the significant accuracy degradation commonly observed in sub-4-bit weight-only quantization of large language models under resource-constrained settings, which stems primarily from quantization bottlenecks in down-projection matrices and activation distribution shifts. To mitigate these issues, the authors propose D²Quant, a framework that jointly optimizes weights and activations. Specifically, they introduce a dual-scale quantizer (DSQ) capable of absorbing scaling factors to alleviate quantization error in down-projection matrices, and incorporate a bias-aware mean correction (DAC) mechanism within LayerNorm to compensate for activation shifts. Notably, D²Quant incurs no additional bit overhead or retraining requirements, yet achieves state-of-the-art performance in sub-4-bit weight-only quantization across multiple mainstream large language models, substantially outperforming existing post-training quantization (PTQ) methods.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) deliver strong performance, but their high compute and memory costs make deployment difficult in resource-constrained scenarios. Weight-only post-training quantization (PTQ) is appealing, as it reduces memory usage and enables practical speedup without low-bit operators or specialized hardware. However, accuracy often degrades significantly in weight-only PTQ at sub-4-bit precision, and our analysis identifies two main causes: (1) down-projection matrices are a well-known quantization bottleneck, but maintaining their fidelity often requires extra bit-width; (2) weight quantization induces activation deviations, but effective correction strategies remain underexplored. To address these issues, we propose D$^2$Quant, a novel weight-only PTQ framework that improves quantization from both the weight and activation perspectives. On the weight side, we design a Dual-Scale Quantizer (DSQ) tailored to down-projection matrices, with an absorbable scaling factor that significantly improves accuracy without increasing the bit budget. On the activation side, we propose Deviation-Aware Correction (DAC), which incorporates a mean-shift correction within LayerNorm to mitigate quantization-induced activation distribution shifts. Extensive experiments across multiple LLM families and evaluation metrics show that D$^2$Quant delivers superior performance for weight-only PTQ at sub-4-bit precision. The code and models will be available at https://github.com/XIANGLONGYAN/D2Quant.

Problem

Research questions and friction points this paper is trying to address.

weight-only quantization

post-training quantization

low-bit precision

activation deviation

down-projection matrices

Innovation

Methods, ideas, or system contributions that make the work stand out.

weight-only quantization

post-training quantization

low-bit