🤖 AI Summary
To address the urgent demand for high-bandwidth, low-latency interconnects in large language model (LLM) training, this paper proposes UB-Mesh—a hierarchical, localized nD-FullMesh datacenter network architecture. Methodologically, it introduces: (1) a modular UB-Mesh-Pod design based on 4D-FullMesh; (2) a Unified Bus (UB) enabling dynamic I/O bandwidth allocation and hardware resource pooling; (3) All-Path-Routing (APR) for multi-path forwarding and a 64+1 redundancy mechanism for fault tolerance; and (4) topology-aware scheduling combined with short-distance direct links to enhance data locality. Experimental evaluation demonstrates that UB-Mesh achieves a 2.04× improvement in cost efficiency over conventional Clos networks, increases network availability by 7.2%, and attains >95% linear scalability in LLM training. These results underscore UB-Mesh’s effectiveness in supporting scalable, reliable, and cost-efficient distributed AI training.
📝 Abstract
As the Large-scale Language Models (LLMs) continue to scale, the requisite computational power and bandwidth escalate. To address this, we introduce UB-Mesh, a novel AI datacenter network architecture designed to enhance scalability, performance, cost-efficiency and availability. Unlike traditional datacenters that provide symmetrical node-to-node bandwidth, UB-Mesh employs a hierarchically localized nD-FullMesh network topology. This design fully leverages the data locality of LLM training, prioritizing short-range, direct interconnects to minimize data movement distance and reduce switch usage. Although UB-Mesh's nD-FullMesh topology offers several theoretical advantages, its concrete architecture design, physical implementation and networking system optimization present new challenges. For the actual construction of UB-Mesh, we first design the UB-Mesh-Pod architecture, which is based on a 4D-FullMesh topology. UB-Mesh-Pod is implemented via a suite of hardware components that serve as the foundational building blocks, including specifically-designed NPU, CPU, Low-Radix-Switch (LRS), High-Radix-Switch (HRS), NICs and others. These components are interconnected via a novel Unified Bus (UB) technique, which enables flexible IO bandwidth allocation and hardware resource pooling. For networking system optimization, we propose advanced routing mechanism named All-Path-Routing (APR) to efficiently manage data traffic. These optimizations, combined with topology-aware performance enhancements and robust reliability measures like 64+1 backup design, result in 2.04x higher cost-efficiency, 7.2% higher network availability compared to traditional Clos architecture and 95%+ linearity in various LLM training tasks.