MARR: Module-Adaptive Residual Reconstruction for Low-Bit Post-Training Quantization

📅 2026-05-18

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This work addresses the performance degradation in low-bit post-training quantization caused by bias introduced from Hessian approximation assumptions in existing residual reconstruction methods. To mitigate this issue, the authors propose a module-level adaptive residual reconstruction mechanism that dynamically balances the bias through learnable scaling coefficients while preserving cross-layer error correction capability. To circumvent the computational cost of per-module search, an efficient coefficient update strategy based on PID feedback control is devised. Experimental results demonstrate significant performance gains under 4-bit and lower quantization settings: accuracy improvements of up to 20.2% are achieved on large language models, and vision Transformers exhibit relative gains of up to 4.6%.

📝 Abstract

Recently, residual reconstruction-based model quantization methods have achieved promising performance in low-bit post-training quantization (PTQ) by introducing cross-layer residuals to reduce error accumulated from previous layers.However, these residuals may also introduce additional bias arising from the Hessian-approximation (HA) assumption underlying reconstruction-based PTQ, leading to suboptimal quantization performance.In this work, we analyze that multiplying the residual term by a scaling coefficient provides a direct way to mitigate the HA bias associated with residual strength, while preserving accumulated-error correction. More importantly, we observe that this trade-off is module-dependent, making a single global residual strength insufficient to balance effective correction and residual-related bias across modules.Based on these observations, we propose Module-Adaptive Residual Reconstruction (MARR), which assigns a module-specific scaling coefficient to adaptively balance accumulated-error correction and residual-related HA bias for each module.To avoid expensive per-module coefficient search and obtain a stable coefficient estimate, we design a Proportional-Integral-Derivative (PID)-based adaptive update strategy that uses reconstruction error as feedback to progressively refine this coefficient. Experiments on several typical large language models (LLMs) and vision transformers (ViTs) demonstrate the effectiveness of MARR under low-bit quantization (less than or equal to 4-bit), achieving up to 20.2% performance gains on LLMs and up to 4.6% relative gains on ViTs over the residual reconstruction state-of-the-art methods.Code will be made publicly available upon acceptance.

Problem

Research questions and friction points this paper is trying to address.

post-training quantization

residual reconstruction

Hessian approximation bias

low-bit quantization

module-adaptive

Innovation

Methods, ideas, or system contributions that make the work stand out.

post-training quantization

residual reconstruction

module-adaptive scaling