🤖 AI Summary
To address the high eviction rate and prolonged queuing latency of low-priority (LP) jobs in GPU clusters under large language model workloads, this paper proposes an SLO-aware dynamic scheduling framework. Methodologically: (1) it builds a lightweight, tenant-level GPU demand time-series forecasting model; (2) dynamically adjusts LP job reservation quotas based on predicted demand; and (3) introduces a priority-aware, fine-grained preemption strategy to minimize disruption to LP jobs. Deployed across a production cluster with over 10,000 GPUs, the framework reduces LP job eviction by 33.0%, cuts average queuing latency by 44.1%, increases GPU allocation rate by 22.8%, and saves $459,000 monthly. The core contribution lies in the first integrated design unifying tenant-level resource forecasting, elastic quota control, and low-impact preemption—enabling efficient coexistence of high-priority (HP) and LP jobs while jointly guaranteeing their SLOs.
📝 Abstract
The surge in large language models (LLMs) has fundamentally reshaped the landscape of GPU usage patterns, creating an urgent need for more efficient management strategies. While cloud providers employ spot instances to reduce costs for low-priority (LP) tasks, existing schedulers still grapple with high eviction rates and lengthy queuing times. To address these limitations, we present GFS, a novel preemptive scheduling framework that enhances service-level objective (SLO) compliance for high-priority (HP) tasks while minimizing preemptions to LP tasks. Firstly, GFS utilizes a lightweight forecasting model that predicts GPU demand among different tenants, enabling proactive resource management. Secondly, GFS employs a dynamic allocation mechanism to adjust the spot quota for LP tasks with guaranteed durations. Lastly, GFS incorporates a preemptive scheduling policy that prioritizes HP tasks while minimizing the impact on LP tasks. We demonstrate the effectiveness of GFS through both real-world implementation and simulations. The results show that GFS reduces eviction rates by 33.0%, and cuts queuing delays by 44.1% for LP tasks. Furthermore, GFS enhances the GPU allocation rate by up to 22.8% in real production clusters. In a production cluster of more than 10,000 GPUs, GFS yields roughly $459,715 in monthly benefits.