๐ค AI Summary
In distributed training, AllReduce synchronization is severely bottlenecked by persistent stragglersโslow nodes that delay global coordination and impede large-model scalability. To address this, we propose StragglAR, the first asynchronous, multi-phase AllReduce algorithm: it executes ReduceScatter on ready GPUs ahead of stragglers and completes remaining synchronization via a novel pluggable collective communication mechanism. StragglAR integrates a hybrid topology scheduler (combining Ring and Halving-Doubling), dynamic barrier bypassing, optimized GPU-to-GPU peer-to-peer communication, and latency-aware operation orchestration. Theoretically, it achieves up to 2ร speedup; empirically, on an 8-GPU server, it improves AllReduce throughput by 22%, significantly reducing per-iteration latency. Crucially, StragglAR maintains high bandwidth efficiency and fault tolerance without compromising correctness or requiring hardware modifications.
๐ Abstract
Distributed machine learning workloads use data and tensor parallelism for training and inference, both of which rely on the AllReduce collective to synchronize gradients or activations. However, bulk-synchronous AllReduce algorithms can be delayed by a persistent straggler that is slower to reach the synchronization barrier required to begin the collective. To address this challenge, we propose StragglAR: an AllReduce algorithm that accelerates distributed training and inference in the presence of persistent stragglers. StragglAR implements a ReduceScatter among the remaining GPUs during the straggler-induced delay, and then executes a novel collective algorithm to complete the AllReduce once the straggler reaches the synchronization barrier. StragglAR achieves a 2x theoretical speedup over popular bandwidth-efficient AllReduce algorithms (e.g., Ring) for large GPU clusters with persistent stragglers. On an 8-GPU server, our implementation of StragglAR yields a 22% speedup over state-of-the-art AllReduce algorithms.