Straggler Tolerant and Resilient DL Training on Homogeneous GPUs

📅 2025-12-10

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This paper identifies CPU and network bandwidth resource imbalance—not GPU heterogeneity—as the primary cause of stragglers in deep learning training on homogeneous GPU clusters; surprisingly, common asynchronous SGD exacerbates this issue. To address it, we propose STAR, a system featuring: (1) a novel grouped-synchronous parameter update mechanism; (2) dynamic synchronization mode selection via hybrid heuristic and online ML models; and (3) coordinated, resource-aware scheduling—integrating parameter server (PS) load balancing, gradient transmission optimization, and proactive CPU/bandwidth overload avoidance—within both PS and All-reduce architectures. Experiments on AWS clusters show STAR reduces time-to-accuracy (TTA) by 48–84% under PS and 51–70% under All-reduce, while preserving convergence accuracy equivalent to synchronous SGD. The implementation is open-sourced.

Technology Category

Application Category

📝 Abstract

Despite the popularity of homogeneous GPU-based deep learning (DL) training, the prevalence, causes and impact of stragglers and the effectiveness of existing straggler mitigation approaches are still not well understood in this scenario due to limited research on these questions. To fill this gap, we conducted comprehensive experiments and found that stragglers remain widespread due to CPU and bandwidth usage imbalances. Additionally, existing mitigation methods that switch from synchronous stochastic gradient descent (SSGD) to asynchronous SGD (ASGD) may not improve Time-To-Accuracy (TTA) and can even generate more stragglers due to its higher resource consumption. To address these newly found problems, we propose the Straggler Tolerant And Resilient DL training system (STAR). STAR includes new synchronization modes that group workers for each parameter updating. It has a heuristic and an ML method to choose the optimal synchronization mode for minimizing TTA, and reallocates resources to support the selected mode while minimizing the impact on co-located jobs. Moreover, it proactively prevents stragglers by avoiding overloading the CPU and bandwidth resources in allocating PSs (which consume high CPU and bandwidth) and in gradient transmission. Our trace-driven evaluation on AWS shows that STAR generates 48-84% and 51-70% lower TTA than state-of-the-art systems in the PS and all-reduce architectures, respectively, while maintaining the converged accuracy of SSGD. The code for STAR is open-sourced.

Problem

Research questions and friction points this paper is trying to address.

Addresses straggler prevalence in homogeneous GPU training

Mitigates inefficiency of existing straggler mitigation methods

Proposes system to optimize synchronization and resource allocation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Grouping workers for parameter updates with new synchronization modes

Using heuristic and ML methods to optimize synchronization mode selection

Proactively preventing stragglers by managing CPU and bandwidth resources

🔎 Similar Papers

Enhancing Large-Scale AI Training Efficiency: The C4 Solution for Real-Time Anomaly Detection and Communication Optimization