🤖 AI Summary
This work addresses the all-to-all communication bottleneck in distributed machine learning and high-performance computing over reconfigurable optical networks by proposing ReTri, a novel scheme that co-optimizes communication patterns and network reconfiguration strategies. ReTri innovatively adapts the Bruck algorithm to reconfigurable architectures through a bidirectional pairwise exchange mechanism based on balanced ternary block propagation, completing all-to-all communication in ⌈log₃n⌉ phases. It further amortizes reconfiguration overhead by reusing topological states across communication phases. Experimental results demonstrate that ReTri achieves up to 10× speedup over static network approaches and improves performance by up to 2.1× compared to existing reconfigurable Bruck-based methods.
📝 Abstract
All-to-All communication is a key performance bottleneck for distributed machine learning (ML) and high-performance computing (HPC) workloads, where dense traffic increasingly stresses scale-up interconnects. While these ML and HPC workloads have driven unprecedented infrastructure demand, optical reconfigurable networks (ORNs) offer a promising path forward. By adapting the physical topology to the active workload, they improve communication cost and bandwidth utilization. However, their benefit is critically contingent on whether the collective consists of structured phases that can be served by sparse and reusable topology states.
In this paper, we revisit Bruck's All-to-All implementation and demonstrate the benefits of topology optimization in which both communication pattern and reconfiguration strategy are co-designed. We present ReTri, a bidirectional All-to-All schedule for ORNs. ReTri uses balanced ternary block propagation to complete All-to-All in $\lceil \log_3 n\rceil$ phases. The induced reconfiguration strategy from ReTri's pairwise bidirectional exchanges allow reconfiguration delays to be amortized across multiple phases. Preliminary simulations show that ReTri improves completion time by up to $10\times$ over static All-to-All, even for millisecond-scale reconfiguration delays, and improving reconfigurable Bruck by up to $2.1\times$.