SprayCheck: Finding Gray Failures in Adaptive Routing Networks

📅 2026-05-05
📈 Citations: 0
Influential: 0
📄 PDF

career value

226K/year
🤖 AI Summary
This work addresses the challenge of detecting gray failures in large-scale distributed training, which are difficult to identify yet severely degrade performance. The authors propose SprayCheck, the first system that enables passive gray failure detection by analyzing statistical patterns of traffic spraying, eliminating the need for active probing. By leveraging flow-level statistics inherent in adaptive routing and load balancing, SprayCheck achieves early identification and localization of failures before they significantly impact training. Evaluated on a 64-node spine-leaf topology, SprayCheck detects link failures with a packet loss rate as low as 1.5% within a single training iteration and identifies losses down to 0.5% within five iterations. This capability substantially enhances the training stability of large models such as Llama-3 70B.
📝 Abstract
Distributed machine learning (ML) training has become a dominant workload in modern data center networks, operating at massive scale with clusters comprising tens to hundreds of thousands of GPUs. The scale of these networks makes failures, and particularly gray failures, inevitable. Gray failures can significantly degrade both network and application performance, yet they are notoriously difficult to detect, localize, and debug. To meet the performance demands of ML workloads, adaptive routing is widely deployed to maximize network utilization by dynamically spreading traffic across many paths. While adaptive routing increases network utilization, it also greatly intensifies the effect of gray failures. Prior work has either dismissed gray failures as negligible or proposed detection mechanisms that fail to scale, rendering these approaches increasingly impractical for large-scale clusters. We present SprayCheck, a passive gray failure detection system that leverages the statistical properties of adaptive routing and network load balancing. By combining these properties with flow-level information, SprayCheck can identify failures before they significantly impact application performance, enabling preemptive rerouting and improving overall performance. Importantly, this is achieved through passive observation of traffic spraying, without introducing additional load on the network. We evaluate SprayCheck and show that it can detect and localize a single-link packet-drop-rate $1.5\%$ within a single iteration and as little as $0.5\%$ within 5 training iterations of Llama-3 70B in a 64 spine topology.
Problem

Research questions and friction points this paper is trying to address.

gray failures
adaptive routing
distributed machine learning
data center networks
failure detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

gray failures
adaptive routing
passive detection
distributed ML training
traffic spraying