GPTVQ: The Blessing of Dimensionality for LLM Quantization

📅 2024-02-23

🏛️ arXiv.org

📈 Citations: 29

✨ Influential: 6

career value

231K/year

🤖 AI Summary

To address the longstanding trade-off between model size and accuracy in large language model (LLM) quantization, this paper proposes GPTVQ, a high-dimensional vector quantization method. Methodologically, GPTVQ introduces four key innovations: (i) theoretical and empirical validation that increasing quantization dimensionality significantly improves the size–accuracy trade-off; (ii) Hessian-weighted output reconstruction loss for more accurate gradient-aware optimization; (iii) Hessian-guided column-wise alternating quantization to preserve layer-wise sensitivity; and (iv) data-aware EM initialization combined with integer-only quantization and SVD-based low-rank compression. Evaluated on Llama-2, Mistral, and other state-of-the-art LLMs, GPTVQ sets new post-training quantization SOTA: it quantizes 70B-parameter models on a single H100 GPU in just 3–11 hours, and achieves lower VQ decompression latency on mobile devices than standard 4-bit integer quantization—marking simultaneous advances in both accuracy retention and deployment efficiency.

Technology Category

Application Category

📝 Abstract

In this work we show that the size versus accuracy trade-off of neural network quantization can be significantly improved by increasing the quantization dimensionality. We propose the GPTVQ method, a new fast method for post-training vector quantization (VQ) that scales well to Large Language Models (LLMs). Our method interleaves quantization of one or more columns with updates to the remaining unquantized weights, using information from the Hessian of the per-layer output reconstruction MSE. Quantization codebooks are initialized using an efficient data-aware version of the EM algorithm. The codebooks are then updated, and further compressed by using integer quantization and SVD-based compression. GPTVQ establishes a new state-of-the art in the size vs accuracy trade-offs on a wide range of LLMs such as Llama-v2 and Mistral. Furthermore, our method is efficient: on a single H100 it takes between 3 and 11 hours to process a Llamav2-70B model, depending on quantization setting. Lastly, with on-device timings for VQ decompression on a mobile CPU we show that VQ leads to improved latency compared to using a 4-bit integer format.

Problem

Research questions and friction points this paper is trying to address.

Improving neural network quantization size versus accuracy trade-off

Developing efficient post-training vector quantization for large language models

Enhancing latency with vector quantization decompression on mobile CPUs

Innovation

Methods, ideas, or system contributions that make the work stand out.

High-dimensional quantization improves size-accuracy trade-off

Hessian-based interleaved column quantization and weight updates

Efficient EM algorithm and SVD compression for codebooks

🔎 Similar Papers

No similar papers found.