CodeGEMM: A Codebook-Centric Approach to Efficient GEMM in Quantized LLMs

📅 2025-12-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Codebook-quantized LLMs suffer from high latency and cache pressure at ultra-low bit-widths (e.g., 2-bit) due to frequent dequantization during GEMM. Method: This paper proposes a codebook-centric GEMM kernel that eliminates per-element dequantization. Instead, it precomputes all possible inner products between activations and codebook centroids, organizing them into an efficient Psumbook lookup table; partial sums are then aggregated directly via indexing. Contribution/Results: Our approach enables the first codebook-driven, end-to-end dequantization-free GEMM, supporting joint optimization of latency, memory footprint, and accuracy. Evaluated on Llama-3, it achieves 1.83× (8B) and 8.93× (70B) speedup over state-of-the-art codebook methods at 2-bit quantization—without any accuracy loss—while significantly improving computational efficiency and memory subsystem utilization.

Technology Category

Application Category

📝 Abstract
Weight-only quantization is widely used to mitigate the memory-bound nature of LLM inference. Codebook-based methods extend this trend by achieving strong accuracy in the extremely low-bit regime (e.g., 2-bit). However, current kernels rely on dequantization, which repeatedly fetches centroids and reconstructs weights, incurring substantial latency and cache pressure. We present CodeGEMM, a codebook-centric GEMM kernel that replaces dequantization with precomputed inner products between centroids and activations stored in a lightweight Psumbook. At inference, code indices directly gather these partial sums, eliminating per-element lookups and reducing the on-chip footprint. The kernel supports the systematic exploration of latency-memory-accuracy trade-offs under a unified implementation. On Llama-3 models, CodeGEMM delivers 1.83x (8B) and 8.93x (70B) speedups in the 2-bit configuration compared to state-of-the-art codebook-based quantization at comparable accuracy and further improves computing efficiency and memory subsystem utilization.
Problem

Research questions and friction points this paper is trying to address.

Improves efficiency of quantized LLMs by eliminating dequantization overhead.
Reduces latency and cache pressure in codebook-based weight quantization.
Enables systematic exploration of latency-memory-accuracy trade-offs.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Precomputes inner products between centroids and activations
Uses code indices to gather partial sums directly
Eliminates per-element lookups reducing on-chip footprint
🔎 Similar Papers
2024-07-15Conference on Empirical Methods in Natural Language ProcessingCitations: 8
Gunho Park
Gunho Park
NAVER Cloud
J
Jeongin Bae
NAVER Cloud
B
Byeongwook Kim
NAVER Cloud
B
Baeseong park
NAVER Cloud
J
Jiwon Ryu
NAVER Cloud
H
Hoseung Kim
NAVER Cloud
Se Jung Kwon
Se Jung Kwon
NAVER Cloud
Deep LearningDNN CompressionDiscrete Event Modeling and Simulation
Dongsoo Lee
Dongsoo Lee
NAVER Cloud
Model compressionoptimizationAI Chip Design