KLLM: Fast LLM Inference with K-Means Quantization

πŸ“… 2025-07-30
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Low-precision inference of large language models (LLMs) faces two key bottlenecks: inefficient non-uniform quantization execution and high overhead from activation outlier detection. This paper proposes KLLM, a hardware-software co-design framework. First, it employs K-means-based non-uniform quantization for weights and activations, preserving model accuracy at 2–4 bits. Second, it introduces an index-based matrix multiplication scheme that eliminates dequantization and floating-point arithmetic entirely. Third, it designs Orizuruβ€”a lightweight, online Top-k outlier detection engine enabling zero-latency outlier identification. Evaluations on NVIDIA A100 and RISC-V Atom platforms demonstrate that KLLM achieves average speedups of 9.67Γ— and 7.03Γ—, respectively, while improving energy efficiency by 229.5Γ— and 150.2Γ—. These results significantly advance the practical deployment of ultra-low-bit LLM inference.

Technology Category

Application Category

πŸ“ Abstract
Large language model (LLM) inference poses significant challenges due to its intensive memory and computation demands. Weight and activation quantization (WAQ) offers a promising solution by reducing both memory footprint and arithmetic complexity. However, two key challenges remain in the existing WAQ designs. (1) Traditional WAQ designs rely on uniform integer-based quantization for hardware efficiency, but this often results in significant accuracy degradation at low precision. K-Means-based quantization, a non-uniform quantization technique, achieves higher accuracy by matching the Gaussian-like distributions of weights and activations in LLMs. However, its non-uniform nature prevents direct execution on low-precision compute units, requiring dequantization and floating-point matrix multiplications (MatMuls) during inference. (2) Activation outliers further hinder effective low-precision WAQ. Offline thresholding methods for outlier detection can lead to significant model performance degradation, while existing online detection techniques introduce substantial runtime overhead. To address the aforementioned challenges and fully unleash the potential of WAQ with K-Means quantization for LLM inference, in this paper, we propose KLLM, a hardware-software co-design framework. KLLM features an index-based computation scheme for efficient execution of MatMuls and nonlinear operations on K-Means-quantized data, which avoids most of the dequantization and full-precision computations. Moreover, KLLM incorporates a novel outlier detection engine, Orizuru, that efficiently identifies the top-$k$ largest and smallest elements in the activation data stream during online inference. Extensive experiments show that, on average, KLLM achieves speedups of 9.67x, 7.03x and energy efficiency improvements of 229.50x, 150.21x compared to the A100 GPU and Atom, respectively.
Problem

Research questions and friction points this paper is trying to address.

Reducing memory and computation demands in LLM inference
Improving accuracy of low-precision quantization in LLMs
Efficiently handling activation outliers during online inference
Innovation

Methods, ideas, or system contributions that make the work stand out.

K-Means quantization for LLM inference
Index-based computation avoids dequantization
Online outlier detection with Orizuru engine
πŸ”Ž Similar Papers
No similar papers found.
X
Xueying Wu
Department of Electrical and Computer Engineering, Duke University
B
Baijun Zhou
Department of Electrical and Computer Engineering, Duke University
Z
Zhihui Gao
Department of Electrical and Computer Engineering, Duke University
Yuzhe Fu
Yuzhe Fu
Duke University
Algorithm-hardware co-design
Qilin Zheng
Qilin Zheng
Duke University
Emerging MemoryIn-Memory-Computing
Y
Yintao He
Department of Electrical and Computer Engineering, Duke University
H
Hai Li
Department of Electrical and Computer Engineering, Duke University