🤖 AI Summary
LLMs suffer from severe accuracy degradation under low-bit quantization due to outlier activations. This paper proposes Kurtosis-driven post-training quantization, introducing inter-layer rotation to reshape activation distributions—specifically mitigating heavy-tailed outliers—and achieving the first fully symmetric 4-bit quantization of weights, activations, and KV caches. Crucially, it formulates kurtosis minimization as the objective for learning layer-wise orthogonal rotation matrices, effectively suppressing outliers while preserving representational fidelity. Rotation learning for Llama3-70B requires only a single H100 GPU, drastically lowering hardware requirements. Experiments demonstrate substantial improvements: +13.3% MMLU accuracy and −15.5% Wiki perplexity over QuaRot; +2.6% MMLU and −2.9% perplexity over SpinQuant, with significantly reduced computational overhead.
📝 Abstract
One of the challenges of quantizing a large language model (LLM) is the presence of outliers. Outliers often make uniform quantization schemes less effective, particularly in extreme cases such as 4-bit quantization. We introduce KurTail, a new post-training quantization (PTQ) scheme that leverages Kurtosis-based rotation to mitigate outliers in the activations of LLMs. Our method optimizes Kurtosis as a measure of tailedness. This approach enables the quantization of weights, activations, and the KV cache in 4 bits. We utilize layer-wise optimization, ensuring memory efficiency. KurTail outperforms existing quantization methods, offering a 13.3% boost in MMLU accuracy and a 15.5% drop in Wiki perplexity compared to QuaRot. It also outperforms SpinQuant with a 2.6% MMLU gain and reduces perplexity by 2.9%, all while reducing the training cost. For comparison, learning the rotation using SpinQuant for Llama3-70B requires at least four NVIDIA H100 80GB GPUs, whereas our method requires only a single GPU, making it a more accessible solution for consumer GPU.