🤖 AI Summary
Codebook-quantized LLMs suffer from high latency and cache pressure at ultra-low bit-widths (e.g., 2-bit) due to frequent dequantization during GEMM.
Method: This paper proposes a codebook-centric GEMM kernel that eliminates per-element dequantization. Instead, it precomputes all possible inner products between activations and codebook centroids, organizing them into an efficient Psumbook lookup table; partial sums are then aggregated directly via indexing.
Contribution/Results: Our approach enables the first codebook-driven, end-to-end dequantization-free GEMM, supporting joint optimization of latency, memory footprint, and accuracy. Evaluated on Llama-3, it achieves 1.83× (8B) and 8.93× (70B) speedup over state-of-the-art codebook methods at 2-bit quantization—without any accuracy loss—while significantly improving computational efficiency and memory subsystem utilization.
📝 Abstract
Weight-only quantization is widely used to mitigate the memory-bound nature of LLM inference. Codebook-based methods extend this trend by achieving strong accuracy in the extremely low-bit regime (e.g., 2-bit). However, current kernels rely on dequantization, which repeatedly fetches centroids and reconstructs weights, incurring substantial latency and cache pressure. We present CodeGEMM, a codebook-centric GEMM kernel that replaces dequantization with precomputed inner products between centroids and activations stored in a lightweight Psumbook. At inference, code indices directly gather these partial sums, eliminating per-element lookups and reducing the on-chip footprint. The kernel supports the systematic exploration of latency-memory-accuracy trade-offs under a unified implementation. On Llama-3 models, CodeGEMM delivers 1.83x (8B) and 8.93x (70B) speedups in the 2-bit configuration compared to state-of-the-art codebook-based quantization at comparable accuracy and further improves computing efficiency and memory subsystem utilization.