CodeGEMM: A Codebook-Centric Approach to Efficient GEMM in Quantized LLMs

📅 2025-12-19

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

Codebook-quantized LLMs suffer from high latency and cache pressure at ultra-low bit-widths (e.g., 2-bit) due to frequent dequantization during GEMM. Method: This paper proposes a codebook-centric GEMM kernel that eliminates per-element dequantization. Instead, it precomputes all possible inner products between activations and codebook centroids, organizing them into an efficient Psumbook lookup table; partial sums are then aggregated directly via indexing. Contribution/Results: Our approach enables the first codebook-driven, end-to-end dequantization-free GEMM, supporting joint optimization of latency, memory footprint, and accuracy. Evaluated on Llama-3, it achieves 1.83× (8B) and 8.93× (70B) speedup over state-of-the-art codebook methods at 2-bit quantization—without any accuracy loss—while significantly improving computational efficiency and memory subsystem utilization.

Technology Category

Application Category

📝 Abstract

Weight-only quantization is widely used to mitigate the memory-bound nature of LLM inference. Codebook-based methods extend this trend by achieving strong accuracy in the extremely low-bit regime (e.g., 2-bit). However, current kernels rely on dequantization, which repeatedly fetches centroids and reconstructs weights, incurring substantial latency and cache pressure. We present CodeGEMM, a codebook-centric GEMM kernel that replaces dequantization with precomputed inner products between centroids and activations stored in a lightweight Psumbook. At inference, code indices directly gather these partial sums, eliminating per-element lookups and reducing the on-chip footprint. The kernel supports the systematic exploration of latency-memory-accuracy trade-offs under a unified implementation. On Llama-3 models, CodeGEMM delivers 1.83x (8B) and 8.93x (70B) speedups in the 2-bit configuration compared to state-of-the-art codebook-based quantization at comparable accuracy and further improves computing efficiency and memory subsystem utilization.

Problem

Research questions and friction points this paper is trying to address.

Improves efficiency of quantized LLMs by eliminating dequantization overhead.

Reduces latency and cache pressure in codebook-based weight quantization.

Enables systematic exploration of latency-memory-accuracy trade-offs.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Precomputes inner products between centroids and activations

Uses code indices to gather partial sums directly

Eliminates per-element lookups reducing on-chip footprint

🔎 Similar Papers

Fast Matrix Multiplications for Lookup Table-Quantized LLMs