Theory-optimal Quantization Based on Flatness

📅 2026-05-11
📈 Citations: 0
Influential: 0
📄 PDF

career value

198K/year
🤖 AI Summary
This work addresses the significant performance degradation of large language models under low-bit post-training quantization, primarily caused by activation outliers. To mitigate this issue, the authors introduce a novel Flatness metric to characterize outlier distributions and derive its theoretical optimum. Building upon this insight, they propose a Bidirectional Diagonal Quantization (BDQ) framework that employs learnable diagonal transformations to uniformly diffuse outlier energy across all dimensions, thereby overcoming limitations inherent in existing linear transformation approaches. Experimental results demonstrate that BDQ achieves less than 1% accuracy loss when applying W4A4 quantization to LLaMA-3-8B. Furthermore, under a W2A4KV16 setting on DeepSeek-R1-Distill-LLaMA-70B, BDQ reduces the performance gap relative to the current state-of-the-art method by 39.1%.
📝 Abstract
Post-training quantization has emerged as a widely adopted technique for compressing and accelerating the inference of Large Language Models (LLMs). The primary challenges in LLMs quantization stem from activation outliers, which significantly degrade model performance especially at lower bit precision. While recent approaches attempt to mitigate outliers through linear transformations across feature dimensions, our analysis reveals that the transformed weights and activations still exhibit persistent outlier patterns with concentrated magnitude distributions. In this paper, we first model the mathematical relationship between quantization error and outliers, and then introduce a new metric Flatness to quantify the distribution of outliers. Based on this, we derive the theoretical optimal solution with respect to Flatness. Building on these insights, we propose Bidirectional Diagonal Quantization (BDQ), a novel post-training quantization framework that effectively disperses outlier patterns through optimized matrix transformations. BDQ strategically distributes outlier magnitudes across matrix dimensions via learned diagonal operations. Extensive experiments demonstrate that BDQ establishes a new quantization benchmark. It achieves less than 1\% accuracy drop in W4A4 quantization on the LLaMA-3-8B model. In the more challenging W2A4KV16 experiment, compared to state-of-the-art approaches, BDQ reduces the performance gap by 39.1\% on the DeepSeek-R1-Distill-LLaMA-70B model.
Problem

Research questions and friction points this paper is trying to address.

post-training quantization
Large Language Models
activation outliers
quantization error
low-bit precision
Innovation

Methods, ideas, or system contributions that make the work stand out.

Flatness
post-training quantization
outlier mitigation
Bidirectional Diagonal Quantization
theoretical optimal quantization
X
Xiusheng Huang
The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China; School of Artificial Intelligence, University of Chinese Academy of Sciences; Beijing Academy of Artificial Intelligence
Z
Zhe Li
AMD
X
Xuanwu Yin
AMD
L
Lu Wang
Ritzz-AI
Y
Yequan Wang
Beijing Academy of Artificial Intelligence
Dong Li
Dong Li
Xilinx Beijing
Computer VisionMachine Learning
Emad Barsoum
Emad Barsoum
AMD, Columbia University
Generative AIFoundation ModelsAgentic AIComputer VisionML Frameworks
K
Kang Liu
The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China; School of Artificial Intelligence, University of Chinese Academy of Sciences