BanaServe: Unified KV Cache and Dynamic Module Migration for Balancing Disaggregated LLM Serving in AI Infrastructure

📅 2025-10-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses three key challenges in disaggregated LLM serving: (1) SLO violations due to static resource allocation, (2) load imbalance between the compute-intensive prefill and memory-intensive decoding phases, and (3) node hotspots caused by prefix-cache-aware routing. To resolve these issues, we propose a dynamic co-scheduling framework featuring layer-level weight migration and attention-level KV cache migration, enabling a globally shared KV cache pool that decouples routing from cache location constraints. Integrated with load-aware scheduling, dynamic module migration, and inter-stage overlapping transfers, our approach achieves fine-grained, cross-phase load balancing. Experimental results demonstrate that, compared to vLLM, our method improves throughput by 1.2–3.9× and reduces total processing time by 3.9%–78.4%; against DistServe, it increases throughput by 1.1–2.8× and decreases latency by 1.4%–70.1%.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) are increasingly deployed in AI infrastructure, driving the need for high throughput, resource efficient serving systems. Disaggregated LLM serving, which separates prompt prefill from auto-regressive decode, has emerged as a promising architecture by isolating their heterogeneous compute and memory demands. However, current disaggregated systems face three key limitations: (i) static resource allocation cannot adapt to highly dynamic workloads, causing over-provisioning that wastes resources or under-provisioning that violates service level objectives (SLOs); (ii) inherent load imbalance between prefill and decode stages, where prefill is compute-bound and decode is memory-bound, causes under-utilization in one tier while the other becomes a bottleneck; and (iii) prefix cache aware routing skews load distribution, as high cache hit rate prefill nodes attract disproportionately more requests, further degrading balance and efficiency. To address these issues, we present BanaServe, a dynamic orchestration framework that continuously rebalances computational and memory resources across prefill and decode instances while eliminating hotspots induced by cache. BanaServe introduces layer level weight migration, attention level Key Value Cache (KV Cache) migration, and Global KV Cache Store sharing with layer wise overlapped transmission, enabling both coarse grained (layer level) and fine grained (attention level) load redistribution with minimal latency overhead. These mechanisms allow routers to perform purely load aware scheduling, unconstrained by cache placement. Compared to vLLM, BanaServe achieves 1.2x-3.9x higher throughput with 3.9%-78.4% lower total processing time, and outperforms DistServe by 1.1x-2.8x in throughput with 1.4%-70.1% latency reduction.
Problem

Research questions and friction points this paper is trying to address.

Dynamic resource allocation for fluctuating LLM workloads
Load imbalance mitigation between compute and memory stages
Cache-induced hotspot elimination through KV cache migration
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified KV cache migration for load balancing
Dynamic module migration across compute instances
Global KV cache store with overlapped transmission
🔎 Similar Papers
No similar papers found.
Y
Yiyuan He
Southern University of Science and Technology, Shenzhen, China
Minxian Xu
Minxian Xu
Associate Professor, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences
Cloud ComputingMicroservicesLLM Inference
Jingfeng Wu
Jingfeng Wu
University of California, Berkeley
deep learning theorymachine learningoptimizationstatistical learning theory
J
Jianmin Hu
Southern University of Science and Technology, Shenzhen, China
Chong Ma
Chong Ma
Southwest Jiaotong University
Deep LearningHuman Computer InteractionMedical Image Analysis
M
Min Shen
AIOS Team, Alibaba Group Inc, HangZhou, China
L
Le Chen
AIOS Team, Alibaba Group Inc, HangZhou, China
C
Chengzhong Xu
State Key Lab of IOTSC, Faculty of Science and Technology, University of Macau, Macau SAR, China
L
Lin Qu
AIOS Team, Alibaba Group Inc, HangZhou, China
Kejiang Ye
Kejiang Ye
Professor, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences
Cloud ComputingAI SystemsIndustrial Internet