🤖 AI Summary
To address high inter-node communication overhead and load imbalance in multi-server cluster deployments of Mixture-of-Experts (MoE) large language models, this paper proposes a topology-aware expert placement optimization method that jointly exploits network topology and expert activation patterns. For the first time, it integrates the physical cluster topology with MoE’s inherent sparse activation and heterogeneous expert computational loads into a tractable integer linear programming (ILP) formulation to achieve optimal expert-to-server assignment. Evaluated on DeepSeekMoE-16B and DeepSeek-R1-671B, the method reduces cross-node communication by up to 42%, improves GPU utilization, and increases end-to-end throughput—outperforming uniform or heuristic-based deployment strategies. The core contribution is a theoretically grounded, topology-aware MoE deployment framework, coupled with a scalable exact optimization approach for practical large-scale inference systems.
📝 Abstract
Efficient deployment of a pre-trained LLM to a cluster with multiple servers is a critical step for providing fast responses to users' queries. The recent success of Mixture-of-Experts (MoE) LLMs raises the question of how to deploy them efficiently, considering their underlying structure. During the inference in MoE LLMs, only a small part of the experts is selected to process a given token. Moreover, in practice, the experts' load is highly imbalanced. For efficient deployment, one has to distribute the model across a large number of servers using a model placement algorithm. Thus, to improve cluster utilization, the model placement algorithm has to take into account the network topology. This work focuses on the efficient topology-aware placement of the pre-trained MoE LLMs in the inference stage. We propose an integer linear program (ILP) that determines the optimal placement of experts, minimizing the expected number of transmissions. Due to the internal structure, this optimization problem can be solved with a standard ILP solver. We demonstrate that ILP-based placement strategy yields lower network traffic than competitors for small-scale (DeepSeekMoE~16B) and large-scale (DeepSeek-R1~671B) models.