Preserve-Then-Quantize: Balancing Rank Budgets for Quantization Error Reconstruction in LLMs

📅 2026-02-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the significant accuracy degradation in existing low-bit post-training quantization methods, which often overlook the intrinsic low-rank structure of model weights. To mitigate this, we propose a Structured Residual Reconstruction (SRR) framework that preserves the top-k singular subspace of activation-weighted weights prior to quantization, quantizes only the residual component, and leverages the remaining rank budget to reconstruct quantization error. Our approach introduces a theoretically grounded rank allocation strategy to determine the optimal k and naturally enables efficient Quantization-aware Parameter-Efficient Fine-Tuning (QPEFT), enhancing training stability. By integrating singular value decomposition with activation-aware and gradient-aware scaling, SRR substantially reduces perplexity across diverse large language models and quantization settings, achieving an average 5.9-point improvement on the GLUE benchmark with 2-bit QPEFT.

Technology Category

Application Category

📝 Abstract
Quantization Error Reconstruction (QER) reduces accuracy loss in Post-Training Quantization (PTQ) by approximating weights as $\mathbf{W} \approx \mathbf{Q} + \mathbf{L}\mathbf{R}$, using a rank-$r$ correction to reconstruct quantization error. Prior methods devote the full rank budget to error reconstruction, which is suboptimal when $\mathbf{W}$ has intrinsic low-rank structure and quantization corrupts dominant directions. We propose Structured Residual Reconstruction (SRR), a rank-allocation framework that preserves the top-$k$ singular subspace of the activation-scaled weight before quantization, quantizes only the residual, and uses the remaining rank $r-k$ for error reconstruction. We derive a theory-guided criterion for selecting $k$ by balancing quantization-exposed energy and unrecoverable error under rank constraints. We further show that resulting $\mathbf{Q} + \mathbf{L}\mathbf{R}$ parameterization naturally supports Quantized Parameter-Efficient Fine-Tuning (QPEFT), and stabilizes fine-tuning via gradient scaling along preserved directions. Experiments demonstrate consistent perplexity reductions across diverse models and quantization settings in PTQ, along with a 5.9 percentage-point average gain on GLUE under 2-bit QPEFT.
Problem

Research questions and friction points this paper is trying to address.

Quantization Error Reconstruction
Post-Training Quantization
Low-Rank Structure
Rank Budget
LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Structured Residual Reconstruction
Quantization Error Reconstruction
Rank Budget Allocation
Post-Training Quantization
Quantized Parameter-Efficient Fine-Tuning
🔎 Similar Papers
No similar papers found.