Rotate, Clip, and Partition: Towards W2A4KV4 Quantization by Integrating Rotation and Learnable Non-uniform Quantizer

📅 2025-02-17

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work addresses the challenge of preserving large language model (LLM) performance under extreme quantization—specifically W2A4KV4 (2-bit weights, 4-bit activations, 4-bit key-value caches). We propose a synergistic framework integrating rotation-based preprocessing, a learnable non-uniform weight quantizer, and quantization-aware training (QAT), augmented by a novel Learnable Direct Partitioning (LDP) mechanism that enables end-to-end joint optimization of non-uniform quantization intervals. Notably, we are the first to rigorously incorporate random rotation theory into 2-bit weight quantization design. To enable efficient inference, we develop a custom GPU GEMV kernel optimized for W2A4 computation. On LLaMA-2-7B, our method incurs only a +2.84 increase in WikiText-2 perplexity while reducing memory footprint by 5.29×. It robustly supports LLaMA-3.2, WizardCoder-7B, and MetaMath-7B without convergence failure or repetitive generation.

Technology Category

Application Category

📝 Abstract

We propose Rotate, Clip, and Partition (RCP), a quantization-aware training (QAT) approach that first realizes extreme compression of LLMs with W2A4KV4(2-bit weight, 4-bit activation, and 4-bit KV cache) configuration. RCP integrates recent rotation techniques with a novel non-uniform weight quantizer design, by quantitatively analyzing the impact of random rotation on 2-bit weight quantization. Our weight quantizer features Learnable Direct Partitioning (LDP), which introduces learnable parameters to directly learn non-uniform intervals jointly with LLM weights. We also present a specialized GPU kernel that supports GEMV on non-uniform W2A4. Experiments show that RCP can compress LLaMA-2-7B to W2A4KV4 with a loss of only 2.84 WikiText2 ppl and 5.29 times reduced memory footprint. Furthermore, RCP can quantize challenging mobile-targeted LLaMA-3.2 models and domain-specific WizardCoder-7B and MetaMath-7B with no critical problems such as convergence failure and repetition. Code will be made available at blind_review.

Problem

Research questions and friction points this paper is trying to address.

Extreme compression of large language models

Integration of rotation techniques with quantization

Development of a GPU kernel for non-uniform quantization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates rotation and non-uniform quantizer

Features Learnable Direct Partitioning (LDP)

Specialized GPU kernel for GEMV support

🔎 Similar Papers

SpinQuant: LLM quantization with learned rotations