SkyLB: A Locality-Aware Cross-Region Load Balancer for LLM Inference

📅 2025-05-30

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Multi-region LLM inference suffers from low GPU utilization and high operational costs, primarily due to geographically localized traffic patterns and diurnal demand fluctuations, which cause severe underutilization of long-term reserved resources (e.g., reserved cloud instances or on-premises clusters). This paper proposes the first cross-regional collaborative scheduling framework that jointly optimizes KV-cache locality and global load balancing—departing from the conventional per-region peak-reservation paradigm toward unified, globally demand-driven instance reservation. Its core mechanisms include cache-aware inter-regional traffic routing, selective push-based load balancing, and request-queue-state-driven dynamic routing. Evaluated on real-world workloads, the system achieves 1.12–2.06× higher throughput, reduces tail latency by 1.74–6.30×, and lowers total service cost by 25%.

Technology Category

Application Category

📝 Abstract

Serving Large Language Models (LLMs) efficiently in multi-region setups remains a challenge. Due to cost and GPU availability concerns, providers typically deploy LLMs in multiple regions using instance with long-term commitments, like reserved instances or on-premise clusters, which are often underutilized due to their region-local traffic handling and diurnal traffic variance. In this paper, we introduce SkyLB, a locality-aware multi-region load balancer for LLM inference that aggregates regional diurnal patterns through cross-region traffic handling. By doing so, SkyLB enables providers to reserve instances based on expected global demand, rather than peak demand in each individual region. Meanwhile, SkyLB preserves KV-Cache locality and a balanced load, ensuring cost efficiency without sacrificing performance. SkyLB achieves this with a cache-aware cross-region traffic handler and a selective pushing load balancing mechanism based on checking pending requests. Our evaluation on real-world workloads shows that it achieves 1.12-2.06x higher throughput and 1.74-6.30x lower latency compared to existing load balancers, while reducing total serving cost by 25%.

Problem

Research questions and friction points this paper is trying to address.

Optimizing multi-region LLM inference with underutilized reserved instances

Balancing global demand and regional diurnal traffic patterns efficiently

Maintaining KV-Cache locality and performance while reducing serving costs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Locality-aware multi-region load balancer

Cache-aware cross-region traffic handler

Selective pushing load balancing mechanism

🔎 Similar Papers

Intelligent Router for LLM Workloads: Improving Performance Through Workload-Aware Load Balancing