Accelerating AllReduce with a Persistent Straggler

📅 2025-05-29

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

In distributed training, AllReduce synchronization is severely bottlenecked by persistent stragglers—slow nodes that delay global coordination and impede large-model scalability. To address this, we propose StragglAR, the first asynchronous, multi-phase AllReduce algorithm: it executes ReduceScatter on ready GPUs ahead of stragglers and completes remaining synchronization via a novel pluggable collective communication mechanism. StragglAR integrates a hybrid topology scheduler (combining Ring and Halving-Doubling), dynamic barrier bypassing, optimized GPU-to-GPU peer-to-peer communication, and latency-aware operation orchestration. Theoretically, it achieves up to 2× speedup; empirically, on an 8-GPU server, it improves AllReduce throughput by 22%, significantly reducing per-iteration latency. Crucially, StragglAR maintains high bandwidth efficiency and fault tolerance without compromising correctness or requiring hardware modifications.

Technology Category

Application Category

📝 Abstract

Distributed machine learning workloads use data and tensor parallelism for training and inference, both of which rely on the AllReduce collective to synchronize gradients or activations. However, bulk-synchronous AllReduce algorithms can be delayed by a persistent straggler that is slower to reach the synchronization barrier required to begin the collective. To address this challenge, we propose StragglAR: an AllReduce algorithm that accelerates distributed training and inference in the presence of persistent stragglers. StragglAR implements a ReduceScatter among the remaining GPUs during the straggler-induced delay, and then executes a novel collective algorithm to complete the AllReduce once the straggler reaches the synchronization barrier. StragglAR achieves a 2x theoretical speedup over popular bandwidth-efficient AllReduce algorithms (e.g., Ring) for large GPU clusters with persistent stragglers. On an 8-GPU server, our implementation of StragglAR yields a 22% speedup over state-of-the-art AllReduce algorithms.

Problem

Research questions and friction points this paper is trying to address.

Addressing delays in AllReduce caused by persistent stragglers

Proposing StragglAR to accelerate distributed training with stragglers

Improving AllReduce efficiency in large GPU clusters

Innovation

Methods, ideas, or system contributions that make the work stand out.

Persistent straggler-aware AllReduce algorithm

ReduceScatter during straggler delay

Novel collective algorithm completion

🔎 Similar Papers

No similar papers found.