Fast NF4 Dequantization Kernels for Large Language Model Inference

📅 2026-04-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Although NF4 quantization effectively reduces memory consumption in large language models, it incurs a significant performance bottleneck during inference on NVIDIA GPUs due to the costly dequantization back to FP16. This work proposes a lightweight shared memory optimization that utilizes only 64 bytes of shared memory per thread block, simplifies indexing logic, and efficiently leverages the GPU memory hierarchy to accelerate NF4 dequantization. The approach is fully compatible with the HuggingFace ecosystem, requires no modifications to existing frameworks, and offers plug-and-play deployment. Evaluated on Gemma-27B, Qwen3-32B, and Llama3.3-70B models, the proposed kernel achieves 2.0–2.2× speedup over BitsAndBytes, with end-to-end inference acceleration of up to 1.54×.
📝 Abstract
Large language models (LLMs) have grown beyond the memory capacity of single GPU devices, necessitating quantization techniques for practical deployment. While NF4 (4-bit NormalFloat) quantization enables 4$\times$ memory reduction, inference on current NVIDIA GPUs (e.g., Ampere A100) requires expensive dequantization back to FP16 format, creating a critical performance bottleneck. This paper presents a lightweight shared memory optimization that addresses this gap through principled memory hierarchy exploitation while maintaining full ecosystem compatibility. We compare our technique against the open-source BitsAndBytes implementation, achieving 2.0--2.2$\times$ kernel speedup across three models (Gemma 27B, Qwen3 32B, and Llama3.3 70B) and up to 1.54$\times$ end-to-end improvement by leveraging the 12--15$\times$ latency advantage of shared memory over global memory access. Our optimization reduces instruction counts through simplified indexing logic while using only 64 bytes of shared memory per thread block, demonstrating that lightweight optimizations can deliver substantial performance gains with minimal engineering effort. This work provides a plug-and-play solution for the HuggingFace ecosystem that democratizes access to advanced models on existing GPU infrastructure.
Problem

Research questions and friction points this paper is trying to address.

NF4 dequantization
large language model inference
GPU memory bottleneck
quantization
performance optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

NF4 dequantization
shared memory optimization
LLM inference
memory hierarchy
kernel acceleration
🔎 Similar Papers
No similar papers found.
X
Xiangbo Qi
University of Southern California
C
Chaoyi Jiang
University of Southern California
Murali Annavaram
Murali Annavaram
USC
Computer Systems