Qronos: Correcting the Past by Shaping the Future... in Post-Training Quantization

📅 2025-05-16

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

To address the accuracy degradation in post-training quantization of neural networks—caused by coupled and accumulated quantization errors across weights, activations, and preceding-layer outputs—this paper proposes Qronos, an iterative quantization algorithm that jointly corrects multi-source errors. Its core contributions are threefold: (i) the first explicit cross-layer modeling and joint correction of quantization errors in weights, activations, and KV caches; (ii) an efficient least-squares optimization framework based on Cholesky decomposition, enhanced by Hadamard transformation to improve numerical conditioning; and (iii) a weight-activation scaling balance mechanism to boost robustness. Evaluated on the Llama3 family, Qronos achieves state-of-the-art performance under joint W4A4KV4 quantization, significantly outperforming adaptive rounding baselines. This demonstrates both theoretical rigor and practical efficacy in mitigating error propagation across transformer layers.

Technology Category

Application Category

📝 Abstract

We introduce Qronos -- a new state-of-the-art post-training quantization algorithm that sequentially rounds and updates neural network weights. Qronos not only explicitly corrects errors due to both weight and activation quantization, but also errors resulting from quantizing previous layers. Our iterative algorithm is based on an interpretable and disciplined optimization framework that subsumes and surpasses existing data-driven approaches. At each step, Qronos alternates between error correction and diffusion via optimal update rules. Importantly, we prove that Qronos admits an efficient implementation that uses the Cholesky decomposition for solving least-squares problems. We also demonstrate that Qronos is compatible with existing transformation techniques such as Hadamard-based incoherence processing and weight-activation scaling equalization, among others. We evaluate Qronos using recent autoregressive language generation models in the Llama3 family; Qronos consistently outperforms previous state-of-the-art adaptive rounding methods when quantizing the weights, activations, and/or KV caches.

Problem

Research questions and friction points this paper is trying to address.

Corrects errors from weight and activation quantization

Addresses errors from quantizing previous layers

Improves post-training quantization for neural networks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sequentially rounds and updates neural network weights

Corrects errors from weight and activation quantization

Uses Cholesky decomposition for efficient least-squares solving

🔎 Similar Papers

DopQ-ViT: Towards Distribution-Friendly and Outlier-Aware Post-Training Quantization for Vision Transformers