KurTail : Kurtosis-based LLM Quantization

📅 2025-03-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
LLMs suffer from severe accuracy degradation under low-bit quantization due to outlier activations. This paper proposes Kurtosis-driven post-training quantization, introducing inter-layer rotation to reshape activation distributions—specifically mitigating heavy-tailed outliers—and achieving the first fully symmetric 4-bit quantization of weights, activations, and KV caches. Crucially, it formulates kurtosis minimization as the objective for learning layer-wise orthogonal rotation matrices, effectively suppressing outliers while preserving representational fidelity. Rotation learning for Llama3-70B requires only a single H100 GPU, drastically lowering hardware requirements. Experiments demonstrate substantial improvements: +13.3% MMLU accuracy and −15.5% Wiki perplexity over QuaRot; +2.6% MMLU and −2.9% perplexity over SpinQuant, with significantly reduced computational overhead.

Technology Category

Application Category

📝 Abstract
One of the challenges of quantizing a large language model (LLM) is the presence of outliers. Outliers often make uniform quantization schemes less effective, particularly in extreme cases such as 4-bit quantization. We introduce KurTail, a new post-training quantization (PTQ) scheme that leverages Kurtosis-based rotation to mitigate outliers in the activations of LLMs. Our method optimizes Kurtosis as a measure of tailedness. This approach enables the quantization of weights, activations, and the KV cache in 4 bits. We utilize layer-wise optimization, ensuring memory efficiency. KurTail outperforms existing quantization methods, offering a 13.3% boost in MMLU accuracy and a 15.5% drop in Wiki perplexity compared to QuaRot. It also outperforms SpinQuant with a 2.6% MMLU gain and reduces perplexity by 2.9%, all while reducing the training cost. For comparison, learning the rotation using SpinQuant for Llama3-70B requires at least four NVIDIA H100 80GB GPUs, whereas our method requires only a single GPU, making it a more accessible solution for consumer GPU.
Problem

Research questions and friction points this paper is trying to address.

Mitigates outliers in LLM activations using Kurtosis-based rotation.
Enables 4-bit quantization of weights, activations, and KV cache.
Reduces training cost and improves accuracy with single GPU.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Kurtosis-based rotation mitigates LLM outliers
Layer-wise optimization ensures memory efficiency
Single GPU reduces training cost significantly
🔎 Similar Papers
No similar papers found.
M
Mohammad Sadegh Akhondzadeh
University of Cologne
Aleksandar Bojchevski
Aleksandar Bojchevski
University of Cologne
Machine LearningGraphs/NetworksTrustworthyRobustnessUncertainty
E
E. Eleftheriou
Axelera AI
M
M. Dazzi
Axelera AI