Hybrid Learning and Optimization-Based Dynamic Scheduling for DL Workloads on Heterogeneous GPU Clusters

📅 2025-12-10

📈 Citations: 0

✨ Influential: 0

career value

242K/year

🤖 AI Summary

Scheduling deep learning jobs on heterogeneous GPU clusters faces challenges including lack of job profiling, application-agnostic requirements, and dynamic workload variations. Method: This paper proposes RLTune—the first offline-profilings-free, application-agnostic real-time scheduling framework. It jointly optimizes job completion time, queueing delay, and GPU utilization via end-to-end integration of Proximal Policy Optimization (PPO) for online priority assignment and Mixed-Integer Linear Programming (MILP) for precise node mapping. Contribution/Results: RLTune establishes the first system-level co-design paradigm integrating RL policies with MILP for cluster scheduling. It generalizes zero-shot across production traces from diverse platforms (Philly, Helios, Alibaba). Experiments demonstrate up to 20% higher GPU utilization, 81% lower queueing delay, and 70% reduced job completion time. Deployment feasibility is further validated in multi-cloud environments.

Technology Category

Application Category

📝 Abstract

Modern cloud platforms increasingly host large-scale deep learning (DL) workloads, demanding high-throughput, low-latency GPU scheduling. However, the growing heterogeneity of GPU clusters and limited visibility into application characteristics pose major challenges for existing schedulers, which often rely on offline profiling or application-specific assumptions. We present RLTune, an application-agnostic reinforcement learning (RL)-based scheduling framework that dynamically prioritizes and allocates DL jobs on heterogeneous GPU clusters. RLTune integrates RL-driven prioritization with MILP-based job-to-node mapping to optimize system-wide objectives such as job completion time (JCT), queueing delay, and resource utilization. Trained on large-scale production traces from Microsoft Philly, Helios, and Alibaba, RLTune improves GPU utilization by up to 20%, reduces queueing delay by up to 81%, and shortens JCT by as much as 70 percent. Unlike prior approaches, RLTune generalizes across diverse workloads without requiring per-job profiling, making it practical for cloud providers to deploy at scale for more efficient, fair, and sustainable DL workload management.

Problem

Research questions and friction points this paper is trying to address.

Optimizes scheduling for DL workloads on heterogeneous GPU clusters

Reduces job completion time and queueing delay without per-job profiling

Improves GPU utilization and generalizes across diverse application workloads

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement learning-based dynamic scheduling for heterogeneous GPU clusters

Integrates RL prioritization with MILP-based job-to-node mapping

Generalizes across workloads without per-job profiling for scalable deployment

🔎 Similar Papers

GPU Cluster Scheduling for Network-Sensitive Deep Learning