Cluster Topology-Driven Placement of Experts Reduces Network Traffic in MoE Inference

📅 2025-08-12

📈 Citations: 0

✨ Influential: 0

career value

247K/year

🤖 AI Summary

To address high inter-node communication overhead and load imbalance in multi-server cluster deployments of Mixture-of-Experts (MoE) large language models, this paper proposes a topology-aware expert placement optimization method that jointly exploits network topology and expert activation patterns. For the first time, it integrates the physical cluster topology with MoE’s inherent sparse activation and heterogeneous expert computational loads into a tractable integer linear programming (ILP) formulation to achieve optimal expert-to-server assignment. Evaluated on DeepSeekMoE-16B and DeepSeek-R1-671B, the method reduces cross-node communication by up to 42%, improves GPU utilization, and increases end-to-end throughput—outperforming uniform or heuristic-based deployment strategies. The core contribution is a theoretically grounded, topology-aware MoE deployment framework, coupled with a scalable exact optimization approach for practical large-scale inference systems.

Technology Category

Application Category

📝 Abstract

Efficient deployment of a pre-trained LLM to a cluster with multiple servers is a critical step for providing fast responses to users' queries. The recent success of Mixture-of-Experts (MoE) LLMs raises the question of how to deploy them efficiently, considering their underlying structure. During the inference in MoE LLMs, only a small part of the experts is selected to process a given token. Moreover, in practice, the experts' load is highly imbalanced. For efficient deployment, one has to distribute the model across a large number of servers using a model placement algorithm. Thus, to improve cluster utilization, the model placement algorithm has to take into account the network topology. This work focuses on the efficient topology-aware placement of the pre-trained MoE LLMs in the inference stage. We propose an integer linear program (ILP) that determines the optimal placement of experts, minimizing the expected number of transmissions. Due to the internal structure, this optimization problem can be solved with a standard ILP solver. We demonstrate that ILP-based placement strategy yields lower network traffic than competitors for small-scale (DeepSeekMoE~16B) and large-scale (DeepSeek-R1~671B) models.

Problem

Research questions and friction points this paper is trying to address.

Optimize expert placement in MoE LLMs for cluster deployment

Minimize network traffic during MoE LLM inference

Balance expert load considering cluster topology

Innovation

Methods, ideas, or system contributions that make the work stand out.

Topology-aware expert placement reduces network traffic

Integer linear program optimizes expert distribution

ILP solver efficiently minimizes transmission counts

🔎 Similar Papers

No similar papers found.