Avoiding Cross-Datacenter Collective Congestion via Disaggregated Buffering

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

241K/year

🤖 AI Summary

This work addresses severe congestion and packet loss in cross-datacenter large model training, caused by contention between collective communication traffic and local flows at destination switches. The authors propose Spillway, a novel mechanism that temporarily buffers packets destined for imminent drop due to congestion in decoupled switch buffers and reinjects them once congestion subsides. Spillway achieves transparent in-network decoupled buffering without requiring modifications to end hosts or training frameworks. By employing congestion-aware buffering and reinjection policies, and validated through large-scale simulations and hardware prototyping, Spillway entirely eliminates performance degradation from collective communication interference, reducing training iteration time by up to 14%.

📝 Abstract

LLM training at the scale of tens of thousands of GPUs now spans multiple datacenters (DC), making cross-DC collectives over long-haul links unavoidable. A critical and overlooked bottleneck arises when these collectives collide with intra-DC traffic at the destination - a common pattern in real workloads. The multi-millisecond congestion control loop is too slow to react, triggering severe packet loss and congestion collapse. We present Spillway, a transparent in-network mechanism that buffers dropped packets in switch-disaggregated buffers in a destination data center and drains them once congestion subsides. Through large-scale end-to-end simulations and a hardware prototype, we show that Spillway eliminates performance degradation from collective collisions, reducing iteration time by up to 14 %, without changes to end hosts or training frameworks.

Problem

Research questions and friction points this paper is trying to address.

cross-datacenter

collective communication

congestion

packet loss

LLM training

Innovation

Methods, ideas, or system contributions that make the work stand out.

disaggregated buffering

cross-datacenter collectives

congestion control