Accelerating MoE with Dynamic In-Switch Computing on Multi-GPUs

📅 2026-05-06

📈 Citations: 0

✨ Influential: 0

career value

269K/year

🤖 AI Summary

This work addresses the performance bottleneck in multi-GPU training of Mixture-of-Experts (MoE) models caused by frequent and redundant cross-GPU communication due to expert parallelism. The authors propose DySHARP, the first framework to enable in-switch computation tailored to MoE’s dynamic and irregular communication patterns. By integrating communication-aware scheduling with token-centric kernel fusion, DySHARP effectively mitigates communication asymmetry. A co-designed hardware-software approach featuring dynamic multi-memory addressing, coupled with NVLink SHARP dynamic scaling, allows DySHARP to achieve up to 1.79× end-to-end training speedup over the state-of-the-art.

📝 Abstract

Mixture-of-Experts (MoE) has been adopted by many leading large models to reduce computational requirements. However, frequent inter-GPU communication in MoE expert parallelism (EP) becomes a performance challenge. We observe substantial redundant inter-GPU data transfers in MoE that can be potentially addressed by in-switch computing. Unfortunately, the existing solution, NVLink SHARP (NVLS), can only support static collectives with regular patterns, incapable of dynamic communication with irregular patterns in MoE. To bridge the functionality gap, we propose DySHARP, an integral dynamic in-switch computing solution to accelerate MoE, encompassing both communication primitives and communication-aware scheduling: 1) Dynamic multimem addressing co-designs ISA, architecture, and runtime, as a dynamic extension to NVLS, reducing redundant traffic. However, the resulting traffic reduction is inherently asymmetric between two directions, preventing it from directly translating into speedup. 2) Token-centric kernel fusion deeply fuses the dispatch-computation-combine pipeline, resolving this asymmetry to translate traffic reduction into actual speedup. Compared with the state-of-the-art solution, DySHARP achieves up to 1.79$\times$ speedup.

Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts

inter-GPU communication

in-switch computing

dynamic communication

expert parallelism

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic In-Switch Computing

Mixture-of-Experts

Multi-GPU Communication