Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving

📅 2024-06-24

🏛️ arXiv.org

📈 Citations: 81

✨ Influential: 8

career value

221K/year

🤖 AI Summary

To address the throughput–latency SLO trade-off caused by long-context requests under high load in LLM serving, this paper proposes a KV-cache-centric decoupled architecture: it separates prefill and decoding computation across distinct resources and constructs a distributed, heterogeneous KVCache layer leveraging idle CPU, DRAM, and SSD. We design an SLO-aware, KVCache-centric scheduler that enables dynamic resource allocation and latency-prediction–guided early request rejection. Our key innovations include the first realization of decoupled KV-cache storage, coordinated scheduling across heterogeneous resources, and SLO-driven, real-time predictive scheduling. Experimental results show a 525% throughput improvement in simulation; under real-world Kimi production traffic, request capacity increases by 75% while strictly satisfying end-to-end latency SLOs.

Technology Category

Application Category

📝 Abstract

Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI. It features a KVCache-centric disaggregated architecture that separates the prefill and decoding clusters. It also leverages the underutilized CPU, DRAM, and SSD resources of the GPU cluster to implement a disaggregated cache of KVCache. The core of Mooncake is its KVCache-centric scheduler, which balances maximizing overall effective throughput while meeting latency-related Service Level Objectives (SLOs). Unlike traditional studies that assume all requests will be processed, Mooncake faces challenges due to highly overloaded scenarios. To mitigate these, we developed a prediction-based early rejection policy. Experiments show that Mooncake excels in long-context scenarios. Compared to the baseline method, Mooncake can achieve up to a 525% increase in throughput in certain simulated scenarios while adhering to SLOs. Under real workloads, Mooncake's innovative architecture enables Kimi to handle 75% more requests.

Problem

Research questions and friction points this paper is trying to address.

Optimizing LLM serving throughput under latency SLO constraints

Managing KVCache resources in disaggregated GPU cluster architecture

Handling request overload through prediction-based rejection policies

Innovation

Methods, ideas, or system contributions that make the work stand out.

KVCache-centric disaggregated architecture separating prefill and decoding

Leverages underutilized CPU, DRAM, SSD resources for disaggregated cache

Prediction-based early rejection policy for overloaded scenarios

🔎 Similar Papers

No similar papers found.