🤖 AI Summary
To address high all-to-all inter-node communication overhead, imbalanced communication load under mesh topologies, and frequent expert migration in MoE inference on wafer-scale chips (WSC), this paper proposes a co-designed mapping and scheduling optimization framework. We introduce Entwined Ring Mapping (ER-Mapping), which exploits the complementary communication intensity patterns—hot in attention layers and cold in MoE layers—to achieve cross-device load balancing. Additionally, we propose a non-intrusive balancer (NI-Balancer) enabling phased expert migration to hide I/O latency. Evaluated on an NVL72 super-node, our approach maintains full model accuracy while achieving a 39% improvement in per-device MoE inference throughput, up to 62% reduction in total communication volume, and 54% and 22% improvements in MoE computation and communication efficiency, respectively.
📝 Abstract
As large language models (LLMs) continue to scale up, mixture-of-experts (MoE) has become a common technology in SOTA models. MoE models rely on expert parallelism (EP) to alleviate memory bottleneck, which introduces all-to-all communication to dispatch and combine tokens across devices. However, in widely-adopted GPU clusters, high-overhead cross-node communication makes all-to-all expensive, hindering the adoption of EP. Recently, wafer-scale chips (WSCs) have emerged as a platform integrating numerous devices on a wafer-sized interposer. WSCs provide a unified high-performance network connecting all devices, presenting a promising potential for hosting MoE models. Yet, their network is restricted to a mesh topology, causing imbalanced communication pressure and performance loss. Moreover, the lack of on-wafer disk leads to high-overhead expert migration on the critical path.
To fully unleash this potential, we first propose Entwined Ring Mapping (ER-Mapping), which co-designs the mapping of attention and MoE layers to balance communication pressure and achieve better performance. We find that under ER-Mapping, the distribution of cold and hot links in the attention and MoE layers is complementary. Therefore, to hide the migration overhead, we propose the Non-invasive Balancer (NI-Balancer), which splits a complete expert migration into multiple steps and alternately utilizes the cold links of both layers. Evaluation shows ER-Mapping achieves communication reduction up to 62%. NI-Balancer further delivers 54% and 22% improvements in MoE computation and communication, respectively. Compared with the SOTA NVL72 supernode, the WSC platform delivers an average 39% higher per-device MoE performance owing to its scalability to larger EP.