KurTail : Kurtosis-based LLM Quantization

📅 2025-03-03

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

LLMs suffer from severe accuracy degradation under low-bit quantization due to outlier activations. This paper proposes Kurtosis-driven post-training quantization, introducing inter-layer rotation to reshape activation distributions—specifically mitigating heavy-tailed outliers—and achieving the first fully symmetric 4-bit quantization of weights, activations, and KV caches. Crucially, it formulates kurtosis minimization as the objective for learning layer-wise orthogonal rotation matrices, effectively suppressing outliers while preserving representational fidelity. Rotation learning for Llama3-70B requires only a single H100 GPU, drastically lowering hardware requirements. Experiments demonstrate substantial improvements: +13.3% MMLU accuracy and −15.5% Wiki perplexity over QuaRot; +2.6% MMLU and −2.9% perplexity over SpinQuant, with significantly reduced computational overhead.

Technology Category

Application Category

📝 Abstract

One of the challenges of quantizing a large language model (LLM) is the presence of outliers. Outliers often make uniform quantization schemes less effective, particularly in extreme cases such as 4-bit quantization. We introduce KurTail, a new post-training quantization (PTQ) scheme that leverages Kurtosis-based rotation to mitigate outliers in the activations of LLMs. Our method optimizes Kurtosis as a measure of tailedness. This approach enables the quantization of weights, activations, and the KV cache in 4 bits. We utilize layer-wise optimization, ensuring memory efficiency. KurTail outperforms existing quantization methods, offering a 13.3% boost in MMLU accuracy and a 15.5% drop in Wiki perplexity compared to QuaRot. It also outperforms SpinQuant with a 2.6% MMLU gain and reduces perplexity by 2.9%, all while reducing the training cost. For comparison, learning the rotation using SpinQuant for Llama3-70B requires at least four NVIDIA H100 80GB GPUs, whereas our method requires only a single GPU, making it a more accessible solution for consumer GPU.

Problem

Research questions and friction points this paper is trying to address.

Mitigates outliers in LLM activations using Kurtosis-based rotation.

Enables 4-bit quantization of weights, activations, and KV cache.

Reduces training cost and improves accuracy with single GPU.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Kurtosis-based rotation mitigates LLM outliers

Layer-wise optimization ensures memory efficiency

Single GPU reduces training cost significantly

🔎 Similar Papers

No similar papers found.