π€ AI Summary
To address memory and computational bottlenecks induced by KV caching in autoregressive decoding of large language models (LLMs) for long-context inference, this work proposes a GPU-native low-bit KV cache acceleration framework. We introduce the first Tensor Coreβcentric BitFusion architecture, enabling efficient tensor-core scheduling for dynamic KV caches; design a warp-level parallel decoding kernel coupled with a fine-grained asynchronous dequantization pipeline; and tightly integrate 2-/4-bit quantization, BitFusion data layout, and CUDA kernel optimizations. Evaluated on RTX 4090, A100, and H100 GPUs, our method achieves up to 7.5Γ, 4.8Γ, and 8.9Γ speedup over FP16 FlashDecoding-v2, respectively, and 4.3Γ speedup over QServe. For LLaMA-3.1-8B at 128K context length, end-to-end inference latency is reduced by 3Γ.
π Abstract
The growing adoption of long-context Large Language Models (LLMs) has introduced significant memory and computational challenges in autoregressive decoding due to the expanding Key-Value (KV) cache. KV cache quantization has emerged as a promising solution, with prior work showing that 4-bit or even 2-bit quantization can maintain model accuracy while reducing memory costs. However, despite these benefits, preliminary implementations for the low-bit KV cache struggle to deliver the expected speedup due to quantization and dequantization overheads and the lack of Tensor Cores utilization. In this work, we propose BitDecoding, a GPU-optimized framework that unlocks Tensor Cores for efficient decoding with low-bit KV cache. Efficiently leveraging Tensor Cores for low-bit KV cache is challenging due to the dynamic nature of KV cache generation at each decoding step. BitDecoding addresses these challenges with a Tensor Cores-Centric BitFusion Scheme that ensures data layout compatibility to enable high utilization of Tensor Cores. Additionally, BitDecoding incorporates a warp-efficient parallel decoding kernel and a fine-grained asynchronous pipeline, minimizing dequantization overhead and improving computational efficiency. Experiments show that BitDecoding achieves up to 7.5x speedup on RTX 4090, 4.8x on A100, and 8.9x on H100, compared to FP16 FlashDecoding-v2. It also outperforms the state-of-the-art low-bit KV cache implementation (QServe) by up to 4.3x. On LLaMA-3.1-8B with a 128K sequence length, BitDecoding reduces single-batch decoding latency by 3x, demonstrating its effectiveness in long-context generation scenarios. The code is available at https://github.com/DD-DuDa/BitDecoding.