🤖 AI Summary
Deploying multiple large language models (LLMs) on shared GPU clusters suffers from severely degraded time-to-first-token (TTFT) due to existing systems’ neglect of workload periodicity—despite improved GPU utilization. This paper proposes PrewarmLLM, the first unified GPU prewarming framework tailored for multi-LLM co-location. Its core contributions are: (1) an eviction-aware model placement strategy grounded in periodic workload modeling; (2) a proactive prewarming scheduling mechanism; and (3) zero-overhead dynamic GPU memory switching. Leveraging a “one-prewarm-multiple-models” paradigm and a generic GPU worker architecture, PrewarmLLM enables cross-model collaborative prewarming. Evaluations demonstrate up to 50.8× TTFT reduction and 2.5× higher request throughput compared to baseline GPU-sharing systems, significantly outperforming state-of-the-art auto-scaling approaches.
📝 Abstract
Deploying multiple models within shared GPU clusters is promising for improving resource efficiency in large language model (LLM) serving. Existing multi-LLM serving systems optimize GPU utilization at the cost of worse inference performance, especially time-to-first-token (TTFT). We identify the root cause of such compromise as their unawareness of future workload characteristics. In contrast, recent analysis on real-world traces has shown the high periodicity and long-term predictability of LLM serving workloads.
We propose universal GPU workers to enable one-for-many GPU prewarming that loads models with knowledge of future workloads. Based on universal GPU workers, we design and build WarmServe, a multi-LLM serving system that (1) mitigates cluster-wide prewarming interference by adopting an evict-aware model placement strategy, (2) prepares universal GPU workers in advance by proactive prewarming, and (3) manages GPU memory with a zero-overhead memory switching mechanism. Evaluation under real-world datasets shows that WarmServe improves TTFT by up to 50.8$ imes$ compared to the state-of-the-art autoscaling-based system, while being capable of serving up to 2.5$ imes$ more requests compared to the GPU-sharing system.