Bridge: Optimizing Collective Communication Schedules in Reconfigurable Networks with Reusable Subrings

📅 2026-05-12
📈 Citations: 0
Influential: 0
📄 PDF

career value

232K/year
🤖 AI Summary
This work addresses the challenge of balancing overhead and performance in reconfigurable optical circuit-switched networks, where collective communication is hindered by high reconfiguration latency and the inability to reuse optical links across communication steps. Focusing on All-to-All and AllReduce primitives prevalent in AI/ML and HPC workloads, the paper introduces a Bruck-based reusable subring mechanism that enables sustained multi-step reuse of optical links for the first time, effectively amortizing reconfiguration costs over multiple communication phases. By integrating sparse reconfiguration with a subring topology maintenance strategy, the system achieves efficient scheduling under millisecond-scale reconfiguration delays. Experiments demonstrate that All-to-All completion time improves by 3–10× over static baselines, while AllReduce outperforms existing approaches by up to 1.5× and surpasses the bandwidth-optimal Ring algorithm by 1.5–6.6× on small-to-medium scale tasks.
📝 Abstract
Optical circuit-switched networks have emerged as an appealing alternative to electrical fabrics as they can reconfigure the network topology at runtime, reducing communication cost and improving bandwidth utilization. Yet exploiting optical reconfigurable networks for collective communication comes with a fundamental trade-off: each reconfiguration incurs non-negligible delay, communication must pause while the fabric reconfigures, and the benefit of a new topology depends on future traffic. The central question is therefore when reconfiguration is worth its cost. While prior work has demonstrated the benefits of reconfiguration, existing strategies use optical links only to optimize the current step, without reusing them for future steps. In this paper, we present Bridge, a reconfiguration strategy for important collective communication primitives used in AI/ML and HPC applications, namely All-to-All, AllReduce, Reduce-Scatter, and AllGather. Bridge exploits the structure of Bruck's communication pattern to support efficient sparse reconfiguration. The key idea is to reduce propagation and transmission delay by directly connecting immediate communication partners and preserve efficient reachability to future peers through connected subrings. As a result, optical links can be reused across multiple subsequent steps, allowing the benefit of reconfiguration to amortize beyond a single step. Our evaluation shows that Bridge reduces All-to-All completion time by typically $3\times$ to $10\times$ over static baselines even with millisecond-scale reconfiguration delays. For AllReduce, Bridge uniformly outperforms existing reconfiguration strategies, delivers up to $1.5\times$ speedup, and exceeds the bandwidth-optimal Ring algorithm by $1.5\times$ to $6.6\times$ on low to moderate-sized workloads.
Problem

Research questions and friction points this paper is trying to address.

reconfigurable networks
collective communication
optical circuit switching
topology reconfiguration
communication scheduling
Innovation

Methods, ideas, or system contributions that make the work stand out.

reconfigurable networks
collective communication
optical circuit switching
subring reuse
Bruck's algorithm