DualMap: Enabling Both Cache Affinity and Load Balancing for Distributed LLM Serving

πŸ“… 2026-02-06
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the inherent tension between cache affinity and load balancing in distributed large language model serving, a challenge inadequately resolved by existing approaches. The authors propose DualMap, a dual-mapping scheduling strategy that, within a unified framework, simultaneously enables efficient reuse of KV caches and effective load balancing for the first time. Built upon the β€œtwo choices” paradigm, DualMap employs two independent hash functions to generate candidate instances and integrates SLO-aware routing, hotspot-aware rebalancing, and a lightweight dual-hash-ring mechanism for elastic scaling. Evaluated under realistic workloads, the system achieves up to a 2.25Γ— increase in effective request capacity under identical TTFT SLO constraints, substantially outperforming state-of-the-art methods.

Technology Category

Application Category

πŸ“ Abstract
In LLM serving, reusing the KV cache of prompts across requests is critical for reducing TTFT and serving costs. Cache-affinity scheduling, which co-locates requests with the same prompt prefix to maximize KV cache reuse, often conflicts with load-balancing scheduling that distributes requests evenly across compute instances. Existing schedulers fail to reconcile this trade-off as they operate within a single mapping space, typically applying cache-affinity routing to a subset of requests and load-balanced routing to the rest, without a unified solution to achieve both goals. To address this limitation, we propose DualMap, a dual-mapping scheduling strategy for distributed LLM serving that achieves both cache affinity and load balancing. Its key idea is to map each request to two candidate instances via two independent hash functions based on the request prompt, then intelligently select the better candidate based on current system states. This design increases the likelihood that requests with shared prefixes are co-located, while evenly dispersing distinct prefixes across the cluster via ``the power of two choices''. To make DualMap robust under dynamic and skewed real-world workloads, we incorporate three techniques: 1) SLO-aware request routing, which prioritizes cache affinity but switches to load-aware scheduling when TTFT exceeds the SLO, enhancing load balance without sacrificing cache reuse; 2) hotspot-aware rebalancing, which dynamically migrates requests from overloaded to underloaded instances, mitigating hotspots and rebalancing the system; 3) lightweight dual-hash-ring scaling, which leverages a dual-hash-ring mapping to support fast and low-overhead instance scaling without costly global remapping. Experiments on real-world workloads show that DualMap improves effective request capacity by up to 2.25$\times$ under the same TTFT SLO constraints compared with SOTA work.
Problem

Research questions and friction points this paper is trying to address.

cache affinity
load balancing
distributed LLM serving
KV cache reuse
request scheduling
Innovation

Methods, ideas, or system contributions that make the work stand out.

DualMap
cache affinity
load balancing
power of two choices
KV cache reuse
πŸ”Ž Similar Papers
No similar papers found.
Ying Yuan
Ying Yuan
Carnegie Mellon University
Robot learning
Pengfei Zuo
Pengfei Zuo
Huawei
AI InfrastructureCloud InfrastructureMachine Learning SystemsMemory SystemsStorage Systems
B
Bo Wang
Huawei
Z
Zhangyu Chen
Huawei
Z
Zhipeng Tan
Huazhong University of Science and Technology
Z
Zhou Yu
Huawei