🤖 AI Summary
This work addresses severe congestion and packet loss in cross-datacenter large model training, caused by contention between collective communication traffic and local flows at destination switches. The authors propose Spillway, a novel mechanism that temporarily buffers packets destined for imminent drop due to congestion in decoupled switch buffers and reinjects them once congestion subsides. Spillway achieves transparent in-network decoupled buffering without requiring modifications to end hosts or training frameworks. By employing congestion-aware buffering and reinjection policies, and validated through large-scale simulations and hardware prototyping, Spillway entirely eliminates performance degradation from collective communication interference, reducing training iteration time by up to 14%.
📝 Abstract
LLM training at the scale of tens of thousands of GPUs now spans multiple datacenters (DC), making cross-DC collectives over long-haul links unavoidable. A critical and overlooked bottleneck arises when these collectives collide with intra-DC traffic at the destination - a common pattern in real workloads. The multi-millisecond congestion control loop is too slow to react, triggering severe packet loss and congestion collapse.
We present Spillway, a transparent in-network mechanism that buffers dropped packets in switch-disaggregated buffers in a destination data center and drains them once congestion subsides. Through large-scale end-to-end simulations and a hardware prototype, we show that Spillway eliminates performance degradation from collective collisions, reducing iteration time by up to 14 %, without changes to end hosts or training frameworks.