GPU Cluster Scheduling for Network-Sensitive Deep Learning

📅 2024-01-29
🏛️ arXiv.org
📈 Citations: 5
Influential: 0
📄 PDF
🤖 AI Summary
Distributed deep learning (DL) jobs are highly sensitive to communication latency, yet existing schedulers lack network awareness. To address this, we propose a network-aware GPU cluster scheduler. Our method introduces three key innovations: (1) a job latency-sensitivity–driven GPU resource colocation mechanism, the first of its kind; (2) a network-aware, fine-grained preemption strategy that dynamically prioritizes latency-critical communication; and (3) an adaptive latency timer for automatic, runtime tuning of scheduling parameters. Integrated into a data-driven distributed DL simulation platform, our scheduler achieves up to 69% reduction in end-to-end training time, 83% decrease in average job completion time, and 98% lower communication overhead under high network congestion. Furthermore, it significantly improves GPU resource utilization and overall training efficiency.

Technology Category

Application Category

📝 Abstract
We propose a novel GPU-cluster scheduler for distributed DL (DDL) workloads that enables proximity based consolidation of GPU resources based on the DDL jobs'sensitivities to the anticipated communication-network delays. Our scheduler consists of three major components: (i) a classical delay scheduling algorithm to facilitate job placement and consolidation; (ii) a network-sensitive job preemption strategy; and (iii) an"auto-tuner"mechanism to optimize delay timers for effective delay scheduling. Additionally, to enable a cost-effective methodology for large-scale experiments, we develop a data-driven DDL cluster simulation platform. Employing the simulation platform we compare against several state-of-the-art alternatives on real-world workload traces to demonstrate the benefits of our design. Our scheduler can provide improvement of up to 69% in end-to-end Makespan for training all jobs compared to the prevailing consolidation-based scheduling methods, while reducing the average job completion time by up to 83% and minimizing the communication overheads by up to 98% under congested networking conditions.
Problem

Research questions and friction points this paper is trying to address.

Optimizes GPU cluster scheduling for network-sensitive distributed deep learning workloads
Improves job completion time and communication efficiency in congested network conditions
Addresses DDL job sensitivities to communication-network delays through proximity-based consolidation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proximity-based GPU consolidation for network-sensitive DDL
Network-aware job preemption strategy for efficient scheduling
Auto-tuner optimizes delay timers in cluster scheduling
🔎 Similar Papers
No similar papers found.
A
Aakash Sharma
Computer Science and Engineering, The Pennsylvania State University
V
Vivek M. Bhasi
Computer Science and Engineering, The Pennsylvania State University
S
Sonali Singh
Computer Science and Engineering, The Pennsylvania State University
G
G. Kesidis
Computer Science and Engineering, The Pennsylvania State University
M
M. Kandemir
Computer Science and Engineering, The Pennsylvania State University
Chita R. Das
Chita R. Das
Computer Science and Engineering, The Pennsylvania State University