MoEntwine: Unleashing the Potential of Wafer-scale Chips for Large-scale Expert Parallel Inference

📅 2025-10-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address high all-to-all inter-node communication overhead, imbalanced communication load under mesh topologies, and frequent expert migration in MoE inference on wafer-scale chips (WSC), this paper proposes a co-designed mapping and scheduling optimization framework. We introduce Entwined Ring Mapping (ER-Mapping), which exploits the complementary communication intensity patterns—hot in attention layers and cold in MoE layers—to achieve cross-device load balancing. Additionally, we propose a non-intrusive balancer (NI-Balancer) enabling phased expert migration to hide I/O latency. Evaluated on an NVL72 super-node, our approach maintains full model accuracy while achieving a 39% improvement in per-device MoE inference throughput, up to 62% reduction in total communication volume, and 54% and 22% improvements in MoE computation and communication efficiency, respectively.

Technology Category

Application Category

📝 Abstract
As large language models (LLMs) continue to scale up, mixture-of-experts (MoE) has become a common technology in SOTA models. MoE models rely on expert parallelism (EP) to alleviate memory bottleneck, which introduces all-to-all communication to dispatch and combine tokens across devices. However, in widely-adopted GPU clusters, high-overhead cross-node communication makes all-to-all expensive, hindering the adoption of EP. Recently, wafer-scale chips (WSCs) have emerged as a platform integrating numerous devices on a wafer-sized interposer. WSCs provide a unified high-performance network connecting all devices, presenting a promising potential for hosting MoE models. Yet, their network is restricted to a mesh topology, causing imbalanced communication pressure and performance loss. Moreover, the lack of on-wafer disk leads to high-overhead expert migration on the critical path. To fully unleash this potential, we first propose Entwined Ring Mapping (ER-Mapping), which co-designs the mapping of attention and MoE layers to balance communication pressure and achieve better performance. We find that under ER-Mapping, the distribution of cold and hot links in the attention and MoE layers is complementary. Therefore, to hide the migration overhead, we propose the Non-invasive Balancer (NI-Balancer), which splits a complete expert migration into multiple steps and alternately utilizes the cold links of both layers. Evaluation shows ER-Mapping achieves communication reduction up to 62%. NI-Balancer further delivers 54% and 22% improvements in MoE computation and communication, respectively. Compared with the SOTA NVL72 supernode, the WSC platform delivers an average 39% higher per-device MoE performance owing to its scalability to larger EP.
Problem

Research questions and friction points this paper is trying to address.

Balancing communication pressure in mesh networks for wafer-scale chips
Reducing high-overhead all-to-all communication in expert parallelism
Hiding expert migration overhead without on-wafer disk storage
Innovation

Methods, ideas, or system contributions that make the work stand out.

ER-Mapping balances communication pressure via layer co-design
NI-Balancer splits expert migration into multiple steps
WSC platform enables scalable expert parallelism with unified network
🔎 Similar Papers
No similar papers found.
Xinru Tang
Xinru Tang
University of California, Irvine
HCICSCWAccessibility
J
Jingxiang Hou
Tsinghua University, School of Integrated Circuits, BNRist, Beijing, China
D
Dingcheng Jiang
Tsinghua University, School of Integrated Circuits, BNRist, Beijing, China
T
Taiquan Wei
Tsinghua University, School of Integrated Circuits, BNRist, Beijing, China
J
Jiaxin Liu
Tsinghua University, School of Integrated Circuits, BNRist, Beijing, China
Jinyi Deng
Jinyi Deng
Tsinghua University, School of Integrated Circuits, BNRist, Beijing, China
Huizheng Wang
Huizheng Wang
Tsinghua University
Sparse AttentionLLM acceleratorAI InfraDistrbited ParallelismVLSI
Qize Yang
Qize Yang
Tongyi Lab, Alibaba Group
Computer VisionDeep Learning
H
Haoran Shang
Tsinghua University, School of Integrated Circuits, BNRist, Beijing, China
C
Chao Li
Shanghai Jiao Tong University, Shanghai, China
Y
Yang Hu
Tsinghua University, School of Integrated Circuits, BNRist, Beijing, China
Shouyi Yin
Shouyi Yin
Tsinghua University