SpinQuant: LLM quantization with learned rotations

📅 2024-05-26
🏛️ arXiv.org
📈 Citations: 39
Influential: 10
📄 PDF
🤖 AI Summary
This work addresses the severe accuracy degradation in post-training quantization (PTQ) of large language models (LLMs) caused by outliers in weights, activations, and KV caches. We propose a learnable rotation preprocessing method: for the first time, we systematically parameterize multiple equivalent orthogonal rotation structures and design an end-to-end trainable rotation matrix via singular value decomposition (SVD) under strict orthogonality constraints—preserving model output while effectively suppressing outliers across all three tensor types. The method is fully compatible with 4-bit weight, activation, and KV cache quantization, and maintains Transformer architectural integrity. On LLaMA-2 7B, our 4-bit PTQ achieves zero-shot accuracy only 2.9 points below full-precision—outperforming SmoothQuant (+25.0), LLM-QAT (+19.1), and QuaRot. On LLaMA-3 8B, it reduces the accuracy gap with QuaRot by 45.1%.

Technology Category

Application Category

📝 Abstract
Post-training quantization (PTQ) techniques applied to weights, activations, and the KV cache greatly reduce memory usage, latency, and power consumption of Large Language Models (LLMs), but may lead to large quantization errors when outliers are present. Rotating activation or weight matrices helps remove outliers and benefits quantization. In this work, we identify a collection of applicable rotation parameterizations that lead to identical outputs in full-precision Transformer architectures while enhancing quantization accuracy. In addition, we find that some random rotations lead to much better quantization than others, with an up to 13 points difference in downstream zero-shot reasoning performance. As a result, we propose SpinQuant, a novel approach that incorporates learned rotation matrices for optimal quantized network accuracy. With 4-bit quantization of weight, activation, and KV-cache, SpinQuant narrows the accuracy gap on zero-shot reasoning tasks with full precision to merely 2.9 points on the LLaMA-2 7B model, surpassing LLM-QAT by 19.1 points and SmoothQuant by 25.0 points. Furthermore, SpinQuant also outperforms concurrent work QuaRot, which applies random rotations to remove outliers. In particular, for LLaMA-3 8B models that are hard to quantize, SpinQuant reduces the gap to full precision by up to 45.1% relative to QuaRot. Code is available at https://github.com/facebookresearch/SpinQuant.
Problem

Research questions and friction points this paper is trying to address.

Reduces quantization errors in LLMs.
Enhances accuracy with learned rotation matrices.
Improves performance on zero-shot reasoning tasks.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Learned rotation matrices enhance quantization.
SpinQuant reduces full-precision accuracy gap significantly.
Optimal rotations improve zero-shot reasoning performance.
🔎 Similar Papers
No similar papers found.