Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching

📅 2025-04-08

📈 Citations: 0

✨ Influential: 0

career value

232K/year

🤖 AI Summary

To address the memory bottleneck in large language model (LLM) inference caused by HBM bandwidth limitations, this work proposes an L2-cache-aware asynchronous KV cache prefetching mechanism: it implicitly prewarms KV data into the GPU’s L2 cache during compute-idle cycles, enabling deep overlap between computation and memory access. The method introduces the first asynchronous, L2-aware KV scheduling strategy—requiring no modifications to model architecture or training pipeline—and is orthogonal to and composable with existing optimizations. Implemented via CUDA kernel-level customization, it achieves a 2.15× improvement in attention kernel efficiency and up to a 1.97× increase in end-to-end inference throughput on NVIDIA H20 GPUs. These gains significantly surpass those of state-of-the-art approaches such as FlashAttention-3, demonstrating substantial practical impact for memory-bound LLM inference.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) exhibit pronounced memory-bound characteristics during inference due to High Bandwidth Memory (HBM) bandwidth constraints. In this paper, we propose an L2 Cache-oriented asynchronous KV Cache prefetching method to break through the memory bandwidth bottleneck in LLM inference through computation-load overlap. By strategically scheduling idle memory bandwidth during active computation windows, our method proactively prefetches required KV Cache into GPU L2 cache, enabling high-speed L2 cache hits for subsequent accesses and effectively hiding HBM access latency within computational cycles. Extensive experiments on NVIDIA H20 GPUs demonstrate that the proposed method achieves 2.15x improvement in attention kernel efficiency and up to 1.97x end-to-end throughput enhancement, surpassing state-of-the-art baseline FlashAttention-3. Notably, our solution maintains orthogonality to existing optimization techniques and can be integrated with current inference frameworks, providing a scalable latency-hiding solution for next-generation LLM inference engines.

Problem

Research questions and friction points this paper is trying to address.

Overcoming memory bandwidth bottleneck in LLM inference

Enhancing KV Cache prefetching via asynchronous L2 Cache

Improving attention kernel efficiency and throughput

Innovation

Methods, ideas, or system contributions that make the work stand out.

Asynchronous KV Cache prefetching for LLM inference

L2 Cache-oriented method to hide HBM latency

Schedules idle memory bandwidth during computation

🔎 Similar Papers

PQCache: Product Quantization-based KVCache for Long Context LLM Inference