CLO: Efficient LLM Inference System with CPU-Light KVCache Offloading via Algorithm-System Co-Design

📅 2025-11-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Million-token-scale LLM inference is bottlenecked by KV cache memory overhead and PCIe transfer latency; existing CPU-offloading approaches suffer from three key limitations: high overhead in fine-grained cache management, low PCIe bandwidth utilization due to dense gather operations, and GPU underutilization (“bubbles”) caused by CPU-centric synchronization. This paper proposes a CPU-lightweight KV cache offloading system featuring head-level approximate caching (coarse-grained with controllable precision), a zero-copy transfer engine, and GPU-centric synchronization—enabling algorithm-system co-optimization. The design significantly reduces CPU computational load, eliminates GPU stalls, and improves PCIe bandwidth utilization. Evaluated on mainstream large language models, it achieves 9.3%–66.6% higher decoding throughput over state-of-the-art systems, while matching their accuracy and substantially improving hardware resource efficiency.

Technology Category

Application Category

📝 Abstract
The growth of million-token LLMs exposes the scalability limits of inference systems, where the KVCache dominates memory usage and data transfer overhead. Recent offloading systems migrate the KVCache to CPU memory and incorporate top-k attention to reduce the volume of data transferred from the CPU, while further applying system-level optimizations such as on-GPU caching and prefetching to lower transfer overhead. However, they overlook the CPU bottleneck in three aspects: (1) substantial overhead of fine-grained dynamic cache management performed on the CPU side, (2) significant transfer overhead from poor PCIe bandwidth utilization caused by heavy gathering operations at the CPU side, and (3) GPU runtime bubbles introduced by coarse-grained CPU-centric synchronization. To address these challenges, we propose CLO, a CPU-light KVCache offloading system via algorithm-system co-design. CLO features: (1) a coarse-grained head-wise approximate on-GPU caching strategy with negligible cache management cost, (2) seamless combination of data prefetching and on-GPU persistent caching for lower transfer overhead, (3) a zero-copy transfer engine to fully exploit PCIe bandwidth, and a GPU-centric synchronization method to eliminate GPU stalls. Evaluation on two widely-used LLMs demonstrates that CLO achieves comparable accuracy to state-of-the-art systems, while substantially minimizing CPU overhead, fully utilizing PCIe bandwidth, thus improving decoding throughput by 9.3%-66.6%. Our results highlight that algorithm-system co-design is essential for memory-constrained LLM inference on modern GPU platforms. We open source CLO at https://github.com/CommediaJW/CLO.
Problem

Research questions and friction points this paper is trying to address.

Addresses CPU bottlenecks in KVCache offloading systems for LLM inference
Reduces transfer overhead by improving PCIe bandwidth utilization efficiency
Eliminates GPU runtime bubbles through GPU-centric synchronization methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Coarse-grained head-wise caching reduces GPU management cost
Zero-copy transfer engine maximizes PCIe bandwidth utilization
GPU-centric synchronization eliminates CPU-induced runtime bubbles
🔎 Similar Papers
No similar papers found.
Jiawei Yi
Jiawei Yi
University of Science and Technology of China
AI System
P
Ping Gong
University of Science and Technology of China
Y
Youhui Bai
University of Science and Technology of China
J
Jiaqi Ruan
University of Science and Technology of China
S
Shengnan Wang
Independent Researcher
P
Pengcheng Wang
Huawei Technologies Co., Ltd
H
Haibo Wang
Huawei Technologies Co., Ltd
W
Weiguang Wang
Huawei Technologies Co., Ltd
X
Xia Zhu
Huawei Technologies Co., Ltd
Feng Wu
Feng Wu
National University of Singapore
Mechine LearningMedical Time Series
C
Cheng Li
University of Science and Technology of China