Avoiding Cross-Datacenter Collective Congestion via Disaggregated Buffering

📅 2026-05-12
📈 Citations: 0
Influential: 0
📄 PDF

career value

247K/year
🤖 AI Summary
This work addresses severe congestion and packet loss in cross-datacenter large model training, caused by contention between collective communication traffic and local flows at destination switches. The authors propose Spillway, a novel mechanism that temporarily buffers packets destined for imminent drop due to congestion in decoupled switch buffers and reinjects them once congestion subsides. Spillway achieves transparent in-network decoupled buffering without requiring modifications to end hosts or training frameworks. By employing congestion-aware buffering and reinjection policies, and validated through large-scale simulations and hardware prototyping, Spillway entirely eliminates performance degradation from collective communication interference, reducing training iteration time by up to 14%.
📝 Abstract
LLM training at the scale of tens of thousands of GPUs now spans multiple datacenters (DC), making cross-DC collectives over long-haul links unavoidable. A critical and overlooked bottleneck arises when these collectives collide with intra-DC traffic at the destination - a common pattern in real workloads. The multi-millisecond congestion control loop is too slow to react, triggering severe packet loss and congestion collapse. We present Spillway, a transparent in-network mechanism that buffers dropped packets in switch-disaggregated buffers in a destination data center and drains them once congestion subsides. Through large-scale end-to-end simulations and a hardware prototype, we show that Spillway eliminates performance degradation from collective collisions, reducing iteration time by up to 14 %, without changes to end hosts or training frameworks.
Problem

Research questions and friction points this paper is trying to address.

cross-datacenter
collective communication
congestion
packet loss
LLM training
Innovation

Methods, ideas, or system contributions that make the work stand out.

disaggregated buffering
cross-datacenter collectives
congestion control
Spillway
LLM training
🔎 Similar Papers