Symphony: Taming Step Misalignments in the Network for Ring-based Collective Operations

📅 2026-04-18
📈 Citations: 0
Influential: 0
📄 PDF

career value

222K/year
🤖 AI Summary
This work addresses the performance degradation in distributed AI training caused by pipeline stage misalignment due to network congestion during collective communication. To mitigate this issue, the authors propose Symphony, a lightweight, programmable-switch-based solution that, for the first time, implements pipeline progress tracking directly in the data plane. Symphony leverages congestion signals to selectively throttle leading flows, thereby promoting synchronization with lagging flows without requiring global coordination. The approach is implemented on Intel Tofino2 switches and evaluated using the Astra-Sim simulation framework, demonstrating its practical feasibility on real hardware. Experimental results show that Symphony can reduce collective communication latency by up to 54%.

Technology Category

Application Category

📝 Abstract
Ring-based collective operations are widely used in distributed AI training due to their efficient bandwidth utilization. While ring communication excels at pipelining, its performance is heavily dependent on having synchronized step-wise progression. This presents a mismatch to the underlying network conditions in practice: collective operations are vulnerable to network jitter and congestion, leading to step misalignment and increased collective completion time. To that end, we propose Symphony, an in-network solution that detects pipeline step misalignment and mitigates its impact. Symphony introduces (1) a lightweight mechanism to track per-job pipeline progress and (2) a novel use of congestion signals to selectively throttle outpacing flows, allowing lagging flows to catch up without global coordination. Through simulations using Astra-Sim, we show that Symphony effectively mitigates step misalignments in ring-based collectives, resulting in up to 54% improvement in job/collective communication time. Finally, we prototype and validate Symphony on an Intel Tofino2 programmable switch to demonstrate its practicality.
Problem

Research questions and friction points this paper is trying to address.

ring-based collective operations
step misalignment
network jitter
congestion
distributed AI training
Innovation

Methods, ideas, or system contributions that make the work stand out.

ring-based collectives
step misalignment
in-network computing
congestion-aware throttling
pipeline synchronization
🔎 Similar Papers