Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving

📅 2024-06-24
🏛️ arXiv.org
📈 Citations: 81
Influential: 8
📄 PDF
🤖 AI Summary
To address the throughput–latency SLO trade-off caused by long-context requests under high load in LLM serving, this paper proposes a KV-cache-centric decoupled architecture: it separates prefill and decoding computation across distinct resources and constructs a distributed, heterogeneous KVCache layer leveraging idle CPU, DRAM, and SSD. We design an SLO-aware, KVCache-centric scheduler that enables dynamic resource allocation and latency-prediction–guided early request rejection. Our key innovations include the first realization of decoupled KV-cache storage, coordinated scheduling across heterogeneous resources, and SLO-driven, real-time predictive scheduling. Experimental results show a 525% throughput improvement in simulation; under real-world Kimi production traffic, request capacity increases by 75% while strictly satisfying end-to-end latency SLOs.

Technology Category

Application Category

📝 Abstract
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI. It features a KVCache-centric disaggregated architecture that separates the prefill and decoding clusters. It also leverages the underutilized CPU, DRAM, and SSD resources of the GPU cluster to implement a disaggregated cache of KVCache. The core of Mooncake is its KVCache-centric scheduler, which balances maximizing overall effective throughput while meeting latency-related Service Level Objectives (SLOs). Unlike traditional studies that assume all requests will be processed, Mooncake faces challenges due to highly overloaded scenarios. To mitigate these, we developed a prediction-based early rejection policy. Experiments show that Mooncake excels in long-context scenarios. Compared to the baseline method, Mooncake can achieve up to a 525% increase in throughput in certain simulated scenarios while adhering to SLOs. Under real workloads, Mooncake's innovative architecture enables Kimi to handle 75% more requests.
Problem

Research questions and friction points this paper is trying to address.

Optimizing LLM serving throughput under latency SLO constraints
Managing KVCache resources in disaggregated GPU cluster architecture
Handling request overload through prediction-based rejection policies
Innovation

Methods, ideas, or system contributions that make the work stand out.

KVCache-centric disaggregated architecture separating prefill and decoding
Leverages underutilized CPU, DRAM, SSD resources for disaggregated cache
Prediction-based early rejection policy for overloaded scenarios
🔎 Similar Papers
No similar papers found.
Ruoyu Qin
Ruoyu Qin
Tsinghua University
Distributed SystemMachine Learning System
Zheming Li
Zheming Li
Sandia National Laboratories
IC engineLaser diagnostic
Weiran He
Weiran He
Unknown affiliation
M
Mingxing Zhang
Tsinghua University
Y
Yongwei Wu
Moonshot AI
W
Weimin Zheng
Moonshot AI
X
Xinran Xu
Moonshot AI