BitDecoding: Unlocking Tensor Cores for Long-Context LLMs Decoding with Low-Bit KV Cache

📅 2025-03-24

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

To address memory and computational bottlenecks induced by KV caching in autoregressive decoding of large language models (LLMs) for long-context inference, this work proposes a GPU-native low-bit KV cache acceleration framework. We introduce the first Tensor Core–centric BitFusion architecture, enabling efficient tensor-core scheduling for dynamic KV caches; design a warp-level parallel decoding kernel coupled with a fine-grained asynchronous dequantization pipeline; and tightly integrate 2-/4-bit quantization, BitFusion data layout, and CUDA kernel optimizations. Evaluated on RTX 4090, A100, and H100 GPUs, our method achieves up to 7.5×, 4.8×, and 8.9× speedup over FP16 FlashDecoding-v2, respectively, and 4.3× speedup over QServe. For LLaMA-3.1-8B at 128K context length, end-to-end inference latency is reduced by 3×.

Technology Category

Application Category

📝 Abstract

The growing adoption of long-context Large Language Models (LLMs) has introduced significant memory and computational challenges in autoregressive decoding due to the expanding Key-Value (KV) cache. KV cache quantization has emerged as a promising solution, with prior work showing that 4-bit or even 2-bit quantization can maintain model accuracy while reducing memory costs. However, despite these benefits, preliminary implementations for the low-bit KV cache struggle to deliver the expected speedup due to quantization and dequantization overheads and the lack of Tensor Cores utilization. In this work, we propose BitDecoding, a GPU-optimized framework that unlocks Tensor Cores for efficient decoding with low-bit KV cache. Efficiently leveraging Tensor Cores for low-bit KV cache is challenging due to the dynamic nature of KV cache generation at each decoding step. BitDecoding addresses these challenges with a Tensor Cores-Centric BitFusion Scheme that ensures data layout compatibility to enable high utilization of Tensor Cores. Additionally, BitDecoding incorporates a warp-efficient parallel decoding kernel and a fine-grained asynchronous pipeline, minimizing dequantization overhead and improving computational efficiency. Experiments show that BitDecoding achieves up to 7.5x speedup on RTX 4090, 4.8x on A100, and 8.9x on H100, compared to FP16 FlashDecoding-v2. It also outperforms the state-of-the-art low-bit KV cache implementation (QServe) by up to 4.3x. On LLaMA-3.1-8B with a 128K sequence length, BitDecoding reduces single-batch decoding latency by 3x, demonstrating its effectiveness in long-context generation scenarios. The code is available at https://github.com/DD-DuDa/BitDecoding.

Problem

Research questions and friction points this paper is trying to address.

Reducing memory and computational challenges in long-context LLMs decoding

Overcoming speedup limitations due to low-bit KV cache quantization overheads

Enabling efficient Tensor Cores utilization for low-bit KV cache decoding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unlocks Tensor Cores for low-bit KV cache

Uses Tensor Cores-Centric BitFusion Scheme

Incorporates warp-efficient parallel decoding kernel

🔎 Similar Papers

PQCache: Product Quantization-based KVCache for Long Context LLM Inference