Theory-optimal Quantization Based on Flatness

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This work addresses the significant performance degradation of large language models under low-bit post-training quantization, primarily caused by activation outliers. To mitigate this issue, the authors introduce a novel Flatness metric to characterize outlier distributions and derive its theoretical optimum. Building upon this insight, they propose a Bidirectional Diagonal Quantization (BDQ) framework that employs learnable diagonal transformations to uniformly diffuse outlier energy across all dimensions, thereby overcoming limitations inherent in existing linear transformation approaches. Experimental results demonstrate that BDQ achieves less than 1% accuracy loss when applying W4A4 quantization to LLaMA-3-8B. Furthermore, under a W2A4KV16 setting on DeepSeek-R1-Distill-LLaMA-70B, BDQ reduces the performance gap relative to the current state-of-the-art method by 39.1%.

📝 Abstract

Post-training quantization has emerged as a widely adopted technique for compressing and accelerating the inference of Large Language Models (LLMs). The primary challenges in LLMs quantization stem from activation outliers, which significantly degrade model performance especially at lower bit precision. While recent approaches attempt to mitigate outliers through linear transformations across feature dimensions, our analysis reveals that the transformed weights and activations still exhibit persistent outlier patterns with concentrated magnitude distributions. In this paper, we first model the mathematical relationship between quantization error and outliers, and then introduce a new metric Flatness to quantify the distribution of outliers. Based on this, we derive the theoretical optimal solution with respect to Flatness. Building on these insights, we propose Bidirectional Diagonal Quantization (BDQ), a novel post-training quantization framework that effectively disperses outlier patterns through optimized matrix transformations. BDQ strategically distributes outlier magnitudes across matrix dimensions via learned diagonal operations. Extensive experiments demonstrate that BDQ establishes a new quantization benchmark. It achieves less than 1\% accuracy drop in W4A4 quantization on the LLaMA-3-8B model. In the more challenging W2A4KV16 experiment, compared to state-of-the-art approaches, BDQ reduces the performance gap by 39.1\% on the DeepSeek-R1-Distill-LLaMA-70B model.

Problem

Research questions and friction points this paper is trying to address.

post-training quantization

Large Language Models

activation outliers

quantization error

low-bit precision

Innovation

Methods, ideas, or system contributions that make the work stand out.

Flatness

post-training quantization

outlier mitigation