π€ AI Summary
To address the poor scalability and high bandwidth cost of conventional interconnects (e.g., Fat-Tree, direct topologies) in training ultra-large-scale LLMs, this paper proposes RailXβa reconfigurable flattened network architecture integrating intra-node direct connections with inter-node optical circuit switching. Its core innovation lies in the first application of Hamiltonian decomposition theory to design a ring-based all-to-all topology, simultaneously optimizing both ring-based collective communication and all-to-all traffic while enabling dynamic fault avoidance and multi-task mapping. Implemented via a 2D physical layout with optical circuit switches, RailX achieves a hop count of only 2β4 at scale exceeding 100,000 chips. It reduces All-Reduce bandwidth cost by over 90% compared to Fat-Tree and cuts All-to-All cost to less than half. For a 200,000-chip system with 1.8 TB/s aggregate interconnect bandwidth, RailXβs total cost is approximately $1.3 billion.
π Abstract
Increasingly large AI workloads are calling for hyper-scale infrastructure; however, traditional interconnection network architecture is neither scalable nor cost-effective enough. Tree-based topologies such as the extit{Rail-optimized} network are extremely expensive, while direct topologies such as extit{Torus} have insufficient bisection bandwidth and flexibility. In this paper, we propose extit{RailX}, a reconfigurable network architecture based on intra-node direct connectivity and inter-node circuit switching. Nodes and optical switches are physically 2D-organized, achieving better scalability than existing centralized circuit switching networks. We propose a novel interconnection method based on extit{Hamiltonian Decomposition} theory to organize separate rail-based rings into extit{all-to-all} topology, simultaneously optimizing ring-collective and all-to-all communication. More than $100$K chips with hyper bandwidth can be interconnected with a flat switching layer, and the diameter is only $2sim4$ inter-node hops. The network cost per injection/All-Reduce bandwidth of extit{RailX} is less than $10%$ of the Fat-Tree, and the cost per bisection/All-to-All bandwidth is less than $50%$ of the Fat-Tree. Specifically, only $sim$$$1.3$B is required to interconnect 200K chips with 1.8TB bandwidth. extit{RailX} can also be used in the ML-as-a-service (MLaaS) scenario, where single or multiple training workloads with various shapes, scales, and parallelism strategies can be flexibly mapped, and failures can be worked around.