🤖 AI Summary
This work addresses the significant challenges posed by large-scale AI training to cluster networks in terms of bandwidth, cost, and scalability. Existing optical circuit-switching solutions are hindered by expensive, high-radix switches with slow reconfiguration times. To overcome these limitations, the paper proposes a co-designed, reconfigurable optical interconnect architecture that leverages low-cost, low-radix optical switch arrays tailored to the communication patterns of AI training workloads. The architecture enables dynamic topology reconfiguration, load-aware adaptation, and fault tolerance, decoupling cost from port count and instead aligning it with functional requirements. Simulations demonstrate that this approach achieves performance comparable to fully connected packet-switched networks when training state-of-the-art large language models, while substantially reducing hardware costs and offering superior bandwidth scalability.
📝 Abstract
Machine learning training places immense demands on cluster networks, motivating specialized architectures and co-design with parallelization strategies. Recent designs incorporating optical circuit switches (OCSes) are promising, offering improved cost, power efficiency, and long-term bandwidth scaling than packet switches. However, most existing approaches rely on costly high-radix OCSes and/or combine them with packet switches to achieve competitive performance at scale. Unfortunately, high-radix OCSes are both expensive and slow to reconfigure, limiting both scalability and performance.
We propose Arrays of Cheap Optical Switches (ACOS), which bring application co-design directly to the structure of the reconfigurable fabric. Using low-radix OCSes as building blocks, ACOS supports the forms of reconfiguration needed in training clusters including topology selection, workload adaptation, and failure resilience. The cost of ACOS scales with supported topologies and adaptations rather than with port count, breaking past the scalability barriers of current specialized ML networks. We show through simulation that ACOS-based deployments match the performance of fully provisioned packet-switched networks when training state-of-the-art LLMs at scale, while delivering significant cost savings using existing off-the-shelf OCSes, with strong bandwidth scaling and higher cost savings in the future.