HeRo-Q: A General Framework for Stable Low Bit Quantization via Hessian Conditioning

📅 2026-01-29

📈 Citations: 0

✨ Influential: 0

career value

228K/year

🤖 AI Summary

This work addresses the discrepancy in post-training quantization where minimizing quantization error alone often yields low reconstruction error but high task loss, a phenomenon rooted in the high sensitivity of the loss landscape to perturbations along a few high-curvature directions dictated by the Hessian matrix. To mitigate this, the authors propose HeRo-Q, a method that introduces a lightweight, learnable rotation-and-compression matrix applied prior to quantization. This matrix is jointly optimized to reshape the loss landscape by effectively reducing the largest Hessian eigenvalues, thereby enhancing robustness to quantization noise. Notably, HeRo-Q is the first approach to combine Hessian conditioning with learnable orthogonal transformations without altering the model architecture, significantly improving stability under extreme low-bit settings such as W3A16. Experiments on Llama and Qwen models demonstrate state-of-the-art performance, with Llama3-8B achieving 70.15% accuracy on GSM8K at W3A16, effectively alleviating the performance collapse typically induced by aggressive quantization.

Technology Category

Application Category

📝 Abstract

Post Training Quantization (PTQ), a mainstream model compression technique, often leads to the paradoxical'low error, high loss'phenomenon because it focuses solely on minimizing quantization error. The root cause lies in the Hessian matrix of the LLM loss landscape: a few high curvature directions are extremely sensitive to perturbations. To address this, we propose the Hessian Robust Quantization (HeRo Q) algorithm, which applies a lightweight, learnable rotation-compression matrix to the weight space prior to quantization. This joint framework reshapes the loss landscape by reducing the largest Hessian eigenvalue and reducing its max eigenvalue, thereby significantly enhancing robustness to quantization noise. HeRo-Q requires no architectural modifications, incurs negligible computational overhead, and integrates seamlessly into existing PTQ pipelines. Experiments on Llama and Qwen models show that HeRo Q consistently outperforms state of the art methods including GPTQ, AWQ, and SpinQuant not only achieving superior performance under standard W4A8 settings, but also excelling in the highly challenging W3A16 ultra low bit regime, where it boosts GSM8K accuracy on Llama3 8B to 70.15\% and effectively avoids the logical collapse commonly seen in aggressive quantization.

Problem

Research questions and friction points this paper is trying to address.

Post Training Quantization

Hessian matrix

low bit quantization

quantization robustness

LLM compression

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hessian conditioning

Post-Training Quantization

Low-bit quantization