Accelerating AllReduce with a Persistent Straggler

๐Ÿ“… 2025-05-29
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
In distributed training, AllReduce synchronization is severely bottlenecked by persistent stragglersโ€”slow nodes that delay global coordination and impede large-model scalability. To address this, we propose StragglAR, the first asynchronous, multi-phase AllReduce algorithm: it executes ReduceScatter on ready GPUs ahead of stragglers and completes remaining synchronization via a novel pluggable collective communication mechanism. StragglAR integrates a hybrid topology scheduler (combining Ring and Halving-Doubling), dynamic barrier bypassing, optimized GPU-to-GPU peer-to-peer communication, and latency-aware operation orchestration. Theoretically, it achieves up to 2ร— speedup; empirically, on an 8-GPU server, it improves AllReduce throughput by 22%, significantly reducing per-iteration latency. Crucially, StragglAR maintains high bandwidth efficiency and fault tolerance without compromising correctness or requiring hardware modifications.

Technology Category

Application Category

๐Ÿ“ Abstract
Distributed machine learning workloads use data and tensor parallelism for training and inference, both of which rely on the AllReduce collective to synchronize gradients or activations. However, bulk-synchronous AllReduce algorithms can be delayed by a persistent straggler that is slower to reach the synchronization barrier required to begin the collective. To address this challenge, we propose StragglAR: an AllReduce algorithm that accelerates distributed training and inference in the presence of persistent stragglers. StragglAR implements a ReduceScatter among the remaining GPUs during the straggler-induced delay, and then executes a novel collective algorithm to complete the AllReduce once the straggler reaches the synchronization barrier. StragglAR achieves a 2x theoretical speedup over popular bandwidth-efficient AllReduce algorithms (e.g., Ring) for large GPU clusters with persistent stragglers. On an 8-GPU server, our implementation of StragglAR yields a 22% speedup over state-of-the-art AllReduce algorithms.
Problem

Research questions and friction points this paper is trying to address.

Addressing delays in AllReduce caused by persistent stragglers
Proposing StragglAR to accelerate distributed training with stragglers
Improving AllReduce efficiency in large GPU clusters
Innovation

Methods, ideas, or system contributions that make the work stand out.

Persistent straggler-aware AllReduce algorithm
ReduceScatter during straggler delay
Novel collective algorithm completion
๐Ÿ”Ž Similar Papers
No similar papers found.