ML Inference Scheduling with Predictable Latency

📅 2025-12-21

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

ML inference serving on GPUs suffers from unpredictable latency due to resource interference under concurrent execution, making it challenging to simultaneously satisfy SLOs, meet deadline constraints, and achieve high GPU utilization. Existing interference prediction approaches are coarse-grained, static, and lack runtime adaptability. To address these limitations, this work proposes a fine-grained, dynamically adaptive interference prediction mechanism: it leverages measurement-driven co-location interference feature extraction, integrates online workload characterization, and employs a lightweight dynamic prediction model—all embedded within a closed-loop scheduling framework. Experimental evaluation demonstrates that our approach improves SLO compliance by 23.6%, reduces tail-latency variability by 41%, and sustains GPU utilization above 82% under multi-model colocation. Collectively, it significantly enhances scheduling predictability and resource efficiency.

Technology Category

Application Category

📝 Abstract

Machine learning (ML) inference serving systems can schedule requests to improve GPU utilization and to meet service level objectives (SLOs) or deadlines. However, improving GPU utilization may compromise latency-sensitive scheduling, as concurrent tasks contend for GPU resources and thereby introduce interference. Given that interference effects introduce unpredictability in scheduling, neglecting them may compromise SLO or deadline satisfaction. Nevertheless, existing interference prediction approaches remain limited in several respects, which may restrict their usefulness for scheduling. First, they are often coarse-grained, which ignores runtime co-location dynamics and thus restricts their accuracy in interference prediction. Second, they tend to use a static prediction model, which may not effectively cope with different workload characteristics. To this end, we evaluate the potential limitations of existing interference prediction approaches and outline our ongoing work toward achieving efficient ML inference scheduling.

Problem

Research questions and friction points this paper is trying to address.

Address unpredictable latency from GPU interference in ML inference scheduling

Improve coarse-grained interference prediction by considering runtime co-location dynamics

Develop adaptive models to handle diverse workload characteristics effectively

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-grained interference prediction for runtime dynamics

Adaptive prediction model for varying workload characteristics

Efficient scheduling to meet latency SLOs and deadlines

🔎 Similar Papers

Intelligent Router for LLM Workloads: Improving Performance Through Workload-Aware Load Balancing