🤖 AI Summary
This work addresses the challenge of efficiently overlapping communication and computation in Mixture-of-Experts (MoE) models on multi-GPU systems, where dynamic and irregular token-to-expert mappings create significant communication bottlenecks. The authors propose a hardware-software co-design that decouples data transfer from address management through a destination-agnostic communication paradigm and introduces lightweight hardware units within the GPU hub to transparently handle address allocation and dataflow scheduling. By employing a hardware-accelerated communication control plane that eliminates software intermediaries, this approach bridges the abstraction gap between MoE’s dynamic routing behavior and the static communication model of GPUs. Evaluated against state-of-the-art systems, the design achieves 1.40–3.08× speedup per layer and 1.21–1.98× end-to-end acceleration.
📝 Abstract
The Mixture-of-Experts (MoE) architecture is crucial for scaling large language models, but its scalability is severely limited by inter-GPU communication bottlenecks in multi-GPU systems. Although overlapping communication with computation is a widely recognized optimization, its effective deployment still remains challenging, both in terms of performance and programmability. In this work, we identify the root cause as a fundamental abstraction mismatch between MoE's dynamic, irregular token-to-expert mapping and the static, address-centric communication model of modern GPUs, which necessitates a complex software mediation phase to resolve addresses before data transfers, limiting performance and software flexibility. To resolve this, we propose MoE-Hub, a hardware-software co-design that introduces a destination-agnostic communication paradigm. MoE-Hub decouples data transmission from address management, allowing producers to send data immediately after routing using only a logical destination, while address allocation and data-flow orchestration are handled transparently by lightweight hardware in the GPU hub. By hardware-accelerating the entire communication control plane, MoE-Hub enables seamless and transparent overlap. Our evaluation shows that MoE-Hub achieves 1.40x-3.08x per-layer and 1.21x-1.98x end-to-end speedup over state-of-the-art systems.