WarmServe: Enabling One-for-Many GPU Prewarming for Multi-LLM Serving

📅 2025-12-10

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

Deploying multiple large language models (LLMs) on shared GPU clusters suffers from severely degraded time-to-first-token (TTFT) due to existing systems’ neglect of workload periodicity—despite improved GPU utilization. This paper proposes PrewarmLLM, the first unified GPU prewarming framework tailored for multi-LLM co-location. Its core contributions are: (1) an eviction-aware model placement strategy grounded in periodic workload modeling; (2) a proactive prewarming scheduling mechanism; and (3) zero-overhead dynamic GPU memory switching. Leveraging a “one-prewarm-multiple-models” paradigm and a generic GPU worker architecture, PrewarmLLM enables cross-model collaborative prewarming. Evaluations demonstrate up to 50.8× TTFT reduction and 2.5× higher request throughput compared to baseline GPU-sharing systems, significantly outperforming state-of-the-art auto-scaling approaches.

Technology Category

Application Category

📝 Abstract

Deploying multiple models within shared GPU clusters is promising for improving resource efficiency in large language model (LLM) serving. Existing multi-LLM serving systems optimize GPU utilization at the cost of worse inference performance, especially time-to-first-token (TTFT). We identify the root cause of such compromise as their unawareness of future workload characteristics. In contrast, recent analysis on real-world traces has shown the high periodicity and long-term predictability of LLM serving workloads. We propose universal GPU workers to enable one-for-many GPU prewarming that loads models with knowledge of future workloads. Based on universal GPU workers, we design and build WarmServe, a multi-LLM serving system that (1) mitigates cluster-wide prewarming interference by adopting an evict-aware model placement strategy, (2) prepares universal GPU workers in advance by proactive prewarming, and (3) manages GPU memory with a zero-overhead memory switching mechanism. Evaluation under real-world datasets shows that WarmServe improves TTFT by up to 50.8$ imes$ compared to the state-of-the-art autoscaling-based system, while being capable of serving up to 2.5$ imes$ more requests compared to the GPU-sharing system.

Problem

Research questions and friction points this paper is trying to address.

Improving GPU resource efficiency in multi-LLM serving without sacrificing inference performance

Addressing time-to-first-token degradation caused by unawareness of future workloads

Mitigating prewarming interference and managing GPU memory for efficient model loading

Innovation

Methods, ideas, or system contributions that make the work stand out.

Universal GPU workers enable one-for-many prewarming

Evict-aware model placement mitigates prewarming interference

Zero-overhead memory switching manages GPU memory efficiently

🔎 Similar Papers

No similar papers found.