GPU Cluster Scheduling for Network-Sensitive Deep Learning

📅 2024-01-29

🏛️ arXiv.org

📈 Citations: 5

✨ Influential: 0

career value

279K/year

🤖 AI Summary

Distributed deep learning (DL) jobs are highly sensitive to communication latency, yet existing schedulers lack network awareness. To address this, we propose a network-aware GPU cluster scheduler. Our method introduces three key innovations: (1) a job latency-sensitivity–driven GPU resource colocation mechanism, the first of its kind; (2) a network-aware, fine-grained preemption strategy that dynamically prioritizes latency-critical communication; and (3) an adaptive latency timer for automatic, runtime tuning of scheduling parameters. Integrated into a data-driven distributed DL simulation platform, our scheduler achieves up to 69% reduction in end-to-end training time, 83% decrease in average job completion time, and 98% lower communication overhead under high network congestion. Furthermore, it significantly improves GPU resource utilization and overall training efficiency.

Technology Category

Application Category

📝 Abstract

We propose a novel GPU-cluster scheduler for distributed DL (DDL) workloads that enables proximity based consolidation of GPU resources based on the DDL jobs'sensitivities to the anticipated communication-network delays. Our scheduler consists of three major components: (i) a classical delay scheduling algorithm to facilitate job placement and consolidation; (ii) a network-sensitive job preemption strategy; and (iii) an"auto-tuner"mechanism to optimize delay timers for effective delay scheduling. Additionally, to enable a cost-effective methodology for large-scale experiments, we develop a data-driven DDL cluster simulation platform. Employing the simulation platform we compare against several state-of-the-art alternatives on real-world workload traces to demonstrate the benefits of our design. Our scheduler can provide improvement of up to 69% in end-to-end Makespan for training all jobs compared to the prevailing consolidation-based scheduling methods, while reducing the average job completion time by up to 83% and minimizing the communication overheads by up to 98% under congested networking conditions.

Problem

Research questions and friction points this paper is trying to address.

Optimizes GPU cluster scheduling for network-sensitive distributed deep learning workloads

Improves job completion time and communication efficiency in congested network conditions

Addresses DDL job sensitivities to communication-network delays through proximity-based consolidation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proximity-based GPU consolidation for network-sensitive DDL

Network-aware job preemption strategy for efficient scheduling

Auto-tuner optimizes delay timers in cluster scheduling

🔎 Similar Papers

No similar papers found.