Rethinking Network Topologies for Cost-Effective Mixture-of-Experts LLM Serving

📅 2026-04-30

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

This work addresses the high communication overhead in Mixture-of-Experts (MoE) large model inference, which typically relies on expensive, high-bandwidth scale-up interconnects. The paper introduces the first cross-layer analytical framework to systematically evaluate the cost-effectiveness of four XPU interconnect topologies—scale-up, scale-out, 3D torus, and 3D full-mesh—for MoE serving. The study demonstrates that low-cost, switchless topologies such as 3D full-mesh achieve Pareto-optimal trade-offs between performance and cost, yielding 20.6%–56.2% higher cost efficiency compared to conventional designs. Furthermore, it reveals widespread over-provisioning of link bandwidth in current systems; moderately reducing bandwidth can improve the throughput-per-cost ratio by up to 27%.

📝 Abstract

Mixture-of-experts (MoE) architectures have turned LLM serving into a cluster-scale workload in which communication consumes a considerable portion of LLM serving runtime. This has prompted industry to invest heavily in expensive high-bandwidth scale-up networks. We question whether such costly infrastructure is strictly necessary. We present the first systematic cross-layer analysis of network cost-effectiveness for MoE LLM serving, comparing four representative XPU (e.g., GPU/TPU) topologies (scale-up, scale-out, 3D torus, and 3D full-mesh). We find that lower-cost switchless topologies are more cost-effective than the scale-up topology across all serving scenarios explored, improving cost-effectiveness by 20.6-56.2%. In particular, the 3D full-mesh topology is Pareto-optimal in terms of the performance-cost tradeoff. We also find that current scale-up link bandwidths are over-provisioned: reducing the link bandwidth improves throughput per cost by up to 27%. A forward-looking analysis of upcoming GPU generations indicates that the cost-performance advantage of switchless networks will likely persist.

Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts

LLM serving

network topology

cost-effectiveness

communication overhead

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts

network topology

cost-effectiveness