VQ-LLM: High-performance Code Generation for Vector Quantization Augmented LLM Inference

📅 2025-03-04

📈 Citations: 0

✨ Influential: 0

career value

250K/year

🤖 AI Summary

To address high latency and inefficient memory access in vector-quantized (VQ) large language model (LLM) inference, this paper proposes an efficient VQ kernel generation framework. It introduces a novel codebook caching abstraction that adaptively distributes codebooks across the full GPU memory hierarchy; designs a codebook-centric dataflow with operator fusion to jointly optimize computation and memory access; and incorporates a heuristic parameter search strategy tailored to diverse VQ configurations. Experiments demonstrate that our approach reduces average inference latency by 46.13% and achieves 64.36%–99.1% speedup over mainstream open-source implementations. At equivalent bit-widths, it matches or outperforms state-of-the-art element-wise quantization methods—including AWQ and KVQuant—while significantly enhancing both inference efficiency and hardware adaptability of VQ-LLMs.

Technology Category

Application Category

📝 Abstract

In this work, we design and implement VQ-LLM, an efficient fused Vector Quantization (VQ) kernel generation framework. We first introduce a software abstraction called codebook cache to optimize codebook access efficiency and support the integration of VQ with various computations. The codebook cache adaptively stores different entries across the GPU's memory hierarchy, including off-chip global memory, on-chip shared memory, and registers. Centered around the codebook cache, we design an efficient computation engine that optimizes memory traffic during computations involving codebooks. This compute engine adopts the codebook-centric dataflow and fusion optimizations. Additionally, we provide adaptive heuristics to tailor parameter selection in our optimizations to diverse VQ configurations. Our optimizations achieve an average latency reduction of 46.13% compared to unoptimized versions. Compared to existing open-source implementations, our methods decrease latency by 64.36% to 99.1%. A final comparison with state-of-the-art element-wise quantization methods like AWQ and KVQuant shows that our VQ-LLM is practically viable, achieving latencies close or even better latencies to those at equivalent bit-widths, potentially offering greater accuracy.

Problem

Research questions and friction points this paper is trying to address.

Optimizes codebook access efficiency for Vector Quantization in LLM inference.

Reduces latency significantly compared to unoptimized and existing implementations.

Achieves competitive performance with state-of-the-art quantization methods.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Codebook cache optimizes VQ memory access.

Codebook-centric dataflow enhances computation efficiency.

Adaptive heuristics tailor VQ configuration parameters.

🔎 Similar Papers

No similar papers found.