RailX: A Flexible, Scalable, and Low-Cost Network Architecture for Hyper-Scale LLM Training Systems

📅 2025-07-24

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

To address the poor scalability and high bandwidth cost of conventional interconnects (e.g., Fat-Tree, direct topologies) in training ultra-large-scale LLMs, this paper proposes RailX—a reconfigurable flattened network architecture integrating intra-node direct connections with inter-node optical circuit switching. Its core innovation lies in the first application of Hamiltonian decomposition theory to design a ring-based all-to-all topology, simultaneously optimizing both ring-based collective communication and all-to-all traffic while enabling dynamic fault avoidance and multi-task mapping. Implemented via a 2D physical layout with optical circuit switches, RailX achieves a hop count of only 2–4 at scale exceeding 100,000 chips. It reduces All-Reduce bandwidth cost by over 90% compared to Fat-Tree and cuts All-to-All cost to less than half. For a 200,000-chip system with 1.8 TB/s aggregate interconnect bandwidth, RailX’s total cost is approximately $1.3 billion.

Technology Category

Application Category

📝 Abstract

Increasingly large AI workloads are calling for hyper-scale infrastructure; however, traditional interconnection network architecture is neither scalable nor cost-effective enough. Tree-based topologies such as the extit{Rail-optimized} network are extremely expensive, while direct topologies such as extit{Torus} have insufficient bisection bandwidth and flexibility. In this paper, we propose extit{RailX}, a reconfigurable network architecture based on intra-node direct connectivity and inter-node circuit switching. Nodes and optical switches are physically 2D-organized, achieving better scalability than existing centralized circuit switching networks. We propose a novel interconnection method based on extit{Hamiltonian Decomposition} theory to organize separate rail-based rings into extit{all-to-all} topology, simultaneously optimizing ring-collective and all-to-all communication. More than $100$K chips with hyper bandwidth can be interconnected with a flat switching layer, and the diameter is only $2sim4$ inter-node hops. The network cost per injection/All-Reduce bandwidth of extit{RailX} is less than $10%$ of the Fat-Tree, and the cost per bisection/All-to-All bandwidth is less than $50%$ of the Fat-Tree. Specifically, only $sim$$$1.3$B is required to interconnect 200K chips with 1.8TB bandwidth. extit{RailX} can also be used in the ML-as-a-service (MLaaS) scenario, where single or multiple training workloads with various shapes, scales, and parallelism strategies can be flexibly mapped, and failures can be worked around.

Problem

Research questions and friction points this paper is trying to address.

Traditional network architectures lack scalability and cost-effectiveness for hyper-scale LLM training.

Existing topologies like Rail-optimized and Torus have high costs or insufficient bandwidth.

RailX aims to provide a flexible, scalable, low-cost solution for large-scale AI workloads.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reconfigurable network with intra-node direct connectivity

2D-organized nodes and optical switches for scalability

Hamiltonian Decomposition for all-to-all topology optimization

🔎 Similar Papers

No similar papers found.