ScoutAttention: Efficient KV Cache Offloading via Layer-Ahead CPU Pre-computation for LLM Inference

📅 2026-03-28
📈 Citations: 0
Influential: 0
📄 PDF

career value

215K/year
🤖 AI Summary
This work addresses the challenge of limited batch sizes in long-context reasoning with large language models, primarily caused by the substantial GPU memory consumption of key-value (KV) caches. Existing offloading approaches suffer from either frequent data transfers or high CPU computational overhead, which degrades GPU utilization. To overcome these limitations, the authors propose ScoutAttention, a novel framework that integrates GPU-CPU cooperative block-sparse attention, layer-wise early CPU precomputation, and an asynchronous periodic cache recall mechanism. This design significantly reduces CPU load while maintaining model accuracy within a 2.4% degradation bound. As a result, ScoutAttention achieves a 2.1× speedup over state-of-the-art methods while substantially improving inference throughput.

Technology Category

Application Category

📝 Abstract
Large language models encounter critical GPU memory capacity constraints during long-context inference, where KV cache memory consumption severely limits decode batch sizes. While existing research has explored offloading KV cache to DRAM, these approaches either demand frequent GPU-CPU data transfers or impose extensive CPU computation requirements, resulting in poor GPU utilization as the system waits for I/O operations or CPU processing to complete. We propose ScoutAttention, a novel KV cache offloading framework that accelerates LLM inference through collaborative GPU-CPU attention computation. To prevent CPU computation from bottlenecking the system, ScoutAttention introduces GPU-CPU collaborative block-wise sparse attention that significantly reduces CPU load. Unlike conventional parallel computing approaches, our framework features a novel layer-ahead CPU pre-computation algorithm, enabling the CPU to initiate attention computation one layer in advance, complemented by asynchronous periodic recall mechanisms to maintain minimal CPU compute load. Experimental results demonstrate that ScoutAttention maintains accuracy within 2.4% of baseline while achieving 2.1x speedup compared to existing offloading methods.
Problem

Research questions and friction points this paper is trying to address.

KV cache
GPU memory bottleneck
LLM inference
CPU offloading
long-context
Innovation

Methods, ideas, or system contributions that make the work stand out.

KV cache offloading
layer-ahead pre-computation
GPU-CPU collaboration
sparse attention
LLM inference acceleration
Q
Qiuyang Zhang
Huazhong University of Science and Technology
K
Kai Zhou
Huazhong University of Science and Technology
D
Ding Tang
Huazhong University of Science and Technology
Kai Lu
Kai Lu
Postdoc, Huazhong University of Science and Technology
Distributed storage systemskey-value storageAI storage
C
Cheng Li
Huawei Technologies
Z
Zhenyu Yang
Huawei Technologies
P
Peng Xu
Research Center for High Efficiency Computing Infrastructure, Zhejiang Lab
J
Jiguang Wan
Huazhong University of Science and Technology