π€ AI Summary
ML inference serving on GPUs suffers from unpredictable latency due to resource interference under concurrent execution, making it challenging to simultaneously satisfy SLOs, meet deadline constraints, and achieve high GPU utilization. Existing interference prediction approaches are coarse-grained, static, and lack runtime adaptability. To address these limitations, this work proposes a fine-grained, dynamically adaptive interference prediction mechanism: it leverages measurement-driven co-location interference feature extraction, integrates online workload characterization, and employs a lightweight dynamic prediction modelβall embedded within a closed-loop scheduling framework. Experimental evaluation demonstrates that our approach improves SLO compliance by 23.6%, reduces tail-latency variability by 41%, and sustains GPU utilization above 82% under multi-model colocation. Collectively, it significantly enhances scheduling predictability and resource efficiency.
π Abstract
Machine learning (ML) inference serving systems can schedule requests to improve GPU utilization and to meet service level objectives (SLOs) or deadlines. However, improving GPU utilization may compromise latency-sensitive scheduling, as concurrent tasks contend for GPU resources and thereby introduce interference. Given that interference effects introduce unpredictability in scheduling, neglecting them may compromise SLO or deadline satisfaction. Nevertheless, existing interference prediction approaches remain limited in several respects, which may restrict their usefulness for scheduling. First, they are often coarse-grained, which ignores runtime co-location dynamics and thus restricts their accuracy in interference prediction. Second, they tend to use a static prediction model, which may not effectively cope with different workload characteristics. To this end, we evaluate the potential limitations of existing interference prediction approaches and outline our ongoing work toward achieving efficient ML inference scheduling.