Model-Preserving Adaptive Rounding

📅 2025-05-29

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

Post-training quantization (PTQ) of large language models (LLMs) often minimizes local activation errors, leading to distortion in the global output distribution. Method: This paper proposes a PTQ method that directly optimizes the full-model KL divergence between the quantized and original model outputs. Its core innovation is the first incorporation of a Kronecker-factored approximation of the layer-wise Hessian matrix into the adaptive rounding process—providing theoretical grounding for rounding decisions without reliance on specific quantizer designs—and the development of a quantizer-agnostic, model-level KL divergence optimization framework. Results: Evaluated on models ranging from 10B to 100B parameters, the method reduces average KL divergence by approximately 30% and achieves state-of-the-art performance on downstream tasks.

Technology Category

Application Category

📝 Abstract

The main goal of post-training quantization (PTQ) is to produced a compressed model whose output distribution is as close to the original model's as possible. To do this tractably, almost all LLM PTQ algorithms quantize linear layers by independently minimizing the immediate activation error. However, this localized objective ignores the effect of subsequent layers, so reducing it does not necessarily give a closer model. In this work, we introduce Yet Another Quantization Algorithm (YAQA), an adaptive rounding algorithm that uses Kronecker-factored approximations of each linear layer's Hessian with respect to the extit{full model} KL divergence. YAQA consists of two components: Kronecker-factored sketches of the full layerwise Hessian that can be tractably computed for hundred-billion parameter LLMs, and a quantizer-independent rounding algorithm that uses these sketches and comes with theoretical guarantees. Across a wide range of models and quantizers, YAQA empirically reduces the KL divergence to the original model by $approx 30%$ while achieving state of the art performance on downstream tasks.

Problem

Research questions and friction points this paper is trying to address.

Minimize output distribution gap in post-training quantization

Address localized error ignoring subsequent layer effects

Improve KL divergence reduction with adaptive rounding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive rounding with Kronecker-factored Hessian approximations

Tractable Kronecker-factored sketches for large models

Quantizer-independent rounding with theoretical guarantees

🔎 Similar Papers

No similar papers found.