InfoQuant: Shaping Activation Distributions for Low-Bit LLM Quantization

📅 2026-05-25

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

This work addresses the significant accuracy degradation in low-bit activation quantization caused by the mismatch between activation distributions and uniform quantizers, a key bottleneck for efficient large language model deployment. From an information-theoretic perspective, the authors propose a training-free activation distribution shaping method that formulates activation transformation as a quantizer-aware distribution design problem. The approach introduces a peak-suppressing orthogonal transform (PSOT) coupled with an adaptive outlier token selection mechanism to simultaneously achieve compact activation ranges and sufficient discretization. Under the W4A4KV4 setting, the method preserves 97% of full-precision accuracy on average and reduces the performance gap by 42% compared to the previous best post-training quantization (PTQ) approach on LLaMA-2 13B, consistently outperforming both existing training-free and end-to-end trained baselines.

📝 Abstract

Low-bit activation quantization remains a major bottleneck in efficient large language model (LLM) deployment. The difficulty is not only that activations contain outliers, but that their distributions are often poorly matched to a low-bit uniform quantizer. Existing post-training quantization (PTQ) methods suppress peaks, balance channels, or minimize reconstruction error, yet they rarely specify what activation distribution is actually easy to discretize. As a result, activations may appear numerically smoother while still incurring large quantization error because the quantization range remains wide or most values collapse into a few levels near the mean. We recast activation transformation as quantizer-facing distribution design and analyze quantization error from an information-theoretic perspective. Our analysis shows that quantization-friendly activations should jointly have a smaller numerical range and sufficient dispersion within that range. Guided by this analysis, we propose InfoQuant, a train-free method that employs Peak Suppression Orthogonal Transformation (PSOT) to shape activations into more quantization-friendly distributions. We further introduce adaptive outlier-token selection to improve the robustness of PSOT during optimization. Across multiple LLM families, InfoQuant consistently outperforms prior PTQ and end-to-end training baselines. Under W4A4KV4, it preserves 97% of floating-point accuracy on average and reduces the LLaMA-2 13B performance gap by 42% over the previous state of the art. Code is available at [https://github.com/LLIKKE/InfoQuant](https://github.com/LLIKKE/InfoQuant)

Problem

Research questions and friction points this paper is trying to address.

low-bit quantization

activation distribution

large language models

quantization error

post-training quantization

Innovation

Methods, ideas, or system contributions that make the work stand out.

activation quantization

distribution shaping

information-theoretic analysis