OServe: Accelerating LLM Serving via Spatial-Temporal Workload Orchestration

📅 2026-02-12

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

Current large language model (LLM) serving systems struggle to handle the heterogeneity of inference requests in both computational and memory demands (spatial dimension) and dynamically varying workloads over time (temporal dimension), leading to suboptimal resource utilization and performance bottlenecks. This work proposes an adaptive deployment framework tailored for spatiotemporally heterogeneous workloads, which jointly leverages workload-aware scheduling, dynamic model migration, and predictive modeling of load fluctuations to enable real-time optimization of deployment strategies. By transcending the limitations of conventional static, homogeneous deployment paradigms, the proposed approach achieves up to 2× higher throughput compared to state-of-the-art systems under real-world workloads, with an average improvement of 1.5×.

Technology Category

Application Category

📝 Abstract

Serving Large Language Models (LLMs) can benefit immensely from parallelizing both the model and input requests across multiple devices, but incoming workloads exhibit substantial spatial and temporal heterogeneity. Spatially, workloads comprise heterogeneous requests with varying compute and memory demands. Temporally, workload composition varies over time. Nevertheless, existing systems typically assume spatially uniform and temporally stable workloads, employing a homogeneous, static model deployment. This mismatch between the assumption and real-world spatial-temporal heterogeneity results in suboptimal performance. We present OServe, an LLM serving system with heterogeneous and flexible model deployment that addresses both spatial and temporal heterogeneity. First, OServe introduces a novel workload-aware scheduling algorithm that optimizes heterogeneous model deployments according to real-time workload characteristics. Second, OServe proposes an efficient workload-adaptive switching method that migrates model deployments in response to predicted workload changes. Experiments on real-world traces show that OServe improves performance by up to 2$\times$ (average: 1.5$\times$) compared to state-of-the-art serving systems.

Problem

Research questions and friction points this paper is trying to address.

LLM serving

spatial heterogeneity

temporal heterogeneity

workload orchestration

model deployment

Innovation

Methods, ideas, or system contributions that make the work stand out.

spatial-temporal orchestration

heterogeneous model deployment

workload-aware scheduling