MoEntwine: Unleashing the Potential of Wafer-scale Chips for Large-scale Expert Parallel Inference

📅 2025-10-29

📈 Citations: 0

✨ Influential: 0

career value

258K/year

🤖 AI Summary

To address high all-to-all inter-node communication overhead, imbalanced communication load under mesh topologies, and frequent expert migration in MoE inference on wafer-scale chips (WSC), this paper proposes a co-designed mapping and scheduling optimization framework. We introduce Entwined Ring Mapping (ER-Mapping), which exploits the complementary communication intensity patterns—hot in attention layers and cold in MoE layers—to achieve cross-device load balancing. Additionally, we propose a non-intrusive balancer (NI-Balancer) enabling phased expert migration to hide I/O latency. Evaluated on an NVL72 super-node, our approach maintains full model accuracy while achieving a 39% improvement in per-device MoE inference throughput, up to 62% reduction in total communication volume, and 54% and 22% improvements in MoE computation and communication efficiency, respectively.

Technology Category

Application Category

📝 Abstract

As large language models (LLMs) continue to scale up, mixture-of-experts (MoE) has become a common technology in SOTA models. MoE models rely on expert parallelism (EP) to alleviate memory bottleneck, which introduces all-to-all communication to dispatch and combine tokens across devices. However, in widely-adopted GPU clusters, high-overhead cross-node communication makes all-to-all expensive, hindering the adoption of EP. Recently, wafer-scale chips (WSCs) have emerged as a platform integrating numerous devices on a wafer-sized interposer. WSCs provide a unified high-performance network connecting all devices, presenting a promising potential for hosting MoE models. Yet, their network is restricted to a mesh topology, causing imbalanced communication pressure and performance loss. Moreover, the lack of on-wafer disk leads to high-overhead expert migration on the critical path. To fully unleash this potential, we first propose Entwined Ring Mapping (ER-Mapping), which co-designs the mapping of attention and MoE layers to balance communication pressure and achieve better performance. We find that under ER-Mapping, the distribution of cold and hot links in the attention and MoE layers is complementary. Therefore, to hide the migration overhead, we propose the Non-invasive Balancer (NI-Balancer), which splits a complete expert migration into multiple steps and alternately utilizes the cold links of both layers. Evaluation shows ER-Mapping achieves communication reduction up to 62%. NI-Balancer further delivers 54% and 22% improvements in MoE computation and communication, respectively. Compared with the SOTA NVL72 supernode, the WSC platform delivers an average 39% higher per-device MoE performance owing to its scalability to larger EP.

Problem

Research questions and friction points this paper is trying to address.

Balancing communication pressure in mesh networks for wafer-scale chips

Reducing high-overhead all-to-all communication in expert parallelism

Hiding expert migration overhead without on-wafer disk storage

Innovation

Methods, ideas, or system contributions that make the work stand out.

ER-Mapping balances communication pressure via layer co-design

NI-Balancer splits expert migration into multiple steps

WSC platform enables scalable expert parallelism with unified network

🔎 Similar Papers

Shortcut-connected Expert Parallelism for Accelerating Mixture-of-Experts