🤖 AI Summary
To address the throughput–latency SLO trade-off caused by long-context requests under high load in LLM serving, this paper proposes a KV-cache-centric decoupled architecture: it separates prefill and decoding computation across distinct resources and constructs a distributed, heterogeneous KVCache layer leveraging idle CPU, DRAM, and SSD. We design an SLO-aware, KVCache-centric scheduler that enables dynamic resource allocation and latency-prediction–guided early request rejection. Our key innovations include the first realization of decoupled KV-cache storage, coordinated scheduling across heterogeneous resources, and SLO-driven, real-time predictive scheduling. Experimental results show a 525% throughput improvement in simulation; under real-world Kimi production traffic, request capacity increases by 75% while strictly satisfying end-to-end latency SLOs.
📝 Abstract
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI. It features a KVCache-centric disaggregated architecture that separates the prefill and decoding clusters. It also leverages the underutilized CPU, DRAM, and SSD resources of the GPU cluster to implement a disaggregated cache of KVCache. The core of Mooncake is its KVCache-centric scheduler, which balances maximizing overall effective throughput while meeting latency-related Service Level Objectives (SLOs). Unlike traditional studies that assume all requests will be processed, Mooncake faces challenges due to highly overloaded scenarios. To mitigate these, we developed a prediction-based early rejection policy. Experiments show that Mooncake excels in long-context scenarios. Compared to the baseline method, Mooncake can achieve up to a 525% increase in throughput in certain simulated scenarios while adhering to SLOs. Under real workloads, Mooncake's innovative architecture enables Kimi to handle 75% more requests.