Towards Multi-Model LLM Schedulers: Empirical Insights into Offloading and Preemption

📅 2026-05-19

📈 Citations: 0

✨ Influential: 0

career value

228K/year

🤖 AI Summary

This work addresses the significant degradation in inference throughput caused by GPU memory constraints when concurrently deploying multiple large language models on shared heterogeneous hardware, where resource scheduling, model offloading, and preemption become critical bottlenecks. Through empirical methodologies—including cross-platform performance profiling, layer-wise offloading experiments, and fine-grained decomposition of preemption overhead—the study systematically uncovers, for the first time, the nonlinear relationship between offloading and throughput decline. It further identifies model state reloading as the primary source of preemption overhead. The findings reveal that smaller models are more sensitive to reduced GPU residency, and that such overhead is jointly influenced by model architecture and hardware characteristics. These insights motivate a scheduler design that integrates model-specific sensitivity with data migration costs, offering crucial guidance for building efficient multi-model serving systems.

📝 Abstract

Modern deployments of Large Language Models (LLMs) increasingly require serving multiple models with diverse architectures, sizes, and specialization on shared, heterogeneous hardware. This setting introduces new challenges for resource allocation, dispatching, and scheduling, particularly under GPU memory constraints where partial CPU-GPU offloading and preemption become necessary. While existing systems primarily optimize throughput for a single model, comparatively little work addresses multi-model scheduling under these conditions. In this paper, we present an empirical study of how different LLMs behave across hardware platforms, focusing on the performance implications of layer offloading and preemption. We show that offloading leads to strongly non-linear and model-dependent degradation in decode throughput, with smaller models exhibiting sharper sensitivity to reduced GPU residency. We further demonstrate that preemption incurs substantial overhead, largely dominated by model state reload rather than key-value cache transfer, and that this cost varies significantly across models and hardware platforms. Additionally, we highlight the role of sequence length and interconnect bandwidth in amplifying data movement and execution inefficiencies. Based on these findings, we identify a set of key features that future schedulers must consider, including model-specific offloading sensitivity, workload characteristics, and the cost structure of preemption and data transfer. These insights provide guidance for the design of next-generation LLM serving systems capable of efficiently managing heterogeneous, multi-model workloads with hybrid CPU-GPU execution.

Problem

Research questions and friction points this paper is trying to address.

multi-model LLM scheduling

GPU memory constraints

CPU-GPU offloading

preemption overhead

heterogeneous hardware

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-model scheduling

LLM offloading

preemption overhead