Priority-Aware Preemptive Scheduling for Mixed-Priority Workloads in MoE Inference

📅 2025-03-12

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

To address head-of-line blocking caused by mixed latency-sensitive (LS) and best-effort (BE) workloads in datacenter-based MoE model inference, this paper proposes a fine-grained, priority-aware preemptive scheduling mechanism. Our approach innovatively enables dynamic expert-level preemption—breaking the constraints of conventional iteration-level FCFS scheduling—and introduces a priority-driven real-time preemption algorithm, execution-state snapshotting and restoration, and a lightweight, modular runtime compatible with Hugging Face. Experiments on NVIDIA A100 show that our method reduces average LS task time-to-first-token (TTFT) by 65.5×, improves SLO compliance to 7 req/sec (where the baseline completely fails), achieves up to 12.8× reduction in job turnaround time, and maintains BE throughput unchanged.

Technology Category

Application Category

📝 Abstract

Large Language Models have revolutionized natural language processing, yet serving them efficiently in data centers remains challenging due to mixed workloads comprising latency-sensitive (LS) and best-effort (BE) jobs. Existing inference systems employ iteration-level first-come-first-served scheduling, causing head-of-line blocking when BE jobs delay LS jobs. We introduce QLLM, a novel inference system designed for Mixture of Experts (MoE) models, featuring a fine-grained, priority-aware preemptive scheduler. QLLM enables expert-level preemption, deferring BE job execution while minimizing LS time-to-first-token (TTFT). Our approach removes iteration-level scheduling constraints, enabling the scheduler to preempt jobs at any layer based on priority. Evaluations on an Nvidia A100 GPU show that QLLM significantly improves performance. It reduces LS TTFT by an average of $65.5 imes$ and meets the SLO at up to $7$ requests/sec, whereas the baseline fails to do so under the tested workload. Additionally, it cuts LS turnaround time by up to $12.8 imes$ without impacting throughput. QLLM is modular, extensible, and seamlessly integrates with Hugging Face MoE models.

Problem

Research questions and friction points this paper is trying to address.

Efficient scheduling for mixed-priority workloads in MoE inference

Reducing latency-sensitive job delays caused by best-effort jobs

Improving time-to-first-token and turnaround time for LS jobs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Priority-aware preemptive scheduler for MoE models

Expert-level preemption minimizes LS job delays

Modular system integrates with Hugging Face MoE

🔎 Similar Papers

No similar papers found.