🤖 AI Summary
Distributed deep learning (DL) jobs are highly sensitive to communication latency, yet existing schedulers lack network awareness. To address this, we propose a network-aware GPU cluster scheduler. Our method introduces three key innovations: (1) a job latency-sensitivity–driven GPU resource colocation mechanism, the first of its kind; (2) a network-aware, fine-grained preemption strategy that dynamically prioritizes latency-critical communication; and (3) an adaptive latency timer for automatic, runtime tuning of scheduling parameters. Integrated into a data-driven distributed DL simulation platform, our scheduler achieves up to 69% reduction in end-to-end training time, 83% decrease in average job completion time, and 98% lower communication overhead under high network congestion. Furthermore, it significantly improves GPU resource utilization and overall training efficiency.
📝 Abstract
We propose a novel GPU-cluster scheduler for distributed DL (DDL) workloads that enables proximity based consolidation of GPU resources based on the DDL jobs'sensitivities to the anticipated communication-network delays. Our scheduler consists of three major components: (i) a classical delay scheduling algorithm to facilitate job placement and consolidation; (ii) a network-sensitive job preemption strategy; and (iii) an"auto-tuner"mechanism to optimize delay timers for effective delay scheduling. Additionally, to enable a cost-effective methodology for large-scale experiments, we develop a data-driven DDL cluster simulation platform. Employing the simulation platform we compare against several state-of-the-art alternatives on real-world workload traces to demonstrate the benefits of our design. Our scheduler can provide improvement of up to 69% in end-to-end Makespan for training all jobs compared to the prevailing consolidation-based scheduling methods, while reducing the average job completion time by up to 83% and minimizing the communication overheads by up to 98% under congested networking conditions.