Eliminating Hidden Serialization in Multi-Node Megakernel Communication

πŸ“… 2026-05-01
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

250K/year
πŸ€– AI Summary
This work addresses the severe performance degradationβ€”up to 10Γ—β€”in multi-node Mixture-of-Experts inference caused by implicit serialization in proxy-based RDMA communication, a problem that worsens with system scale. To overcome this bottleneck, the authors propose a persistent kernel architecture that tightly integrates expert computation with fine-grained GPU-initiated communication. Central to their approach are two key innovations: a decoupled signaling mechanism that replaces proxy-induced blocking with destination-granularity fence batching, and NIC-side ordering that leverages hardware fence flags to eliminate serial bottlenecks in the NIC pipeline. This redesign fundamentally restructures the compute-communication overlap under proxy-based RDMA, achieving up to 10.3Γ— end-to-end speedup on an 8-node system and outperforming the IBGDA GPU-direct baseline by 1.2Γ—.
πŸ“ Abstract
Recent megakernel designs for Mixture-of-Experts (MoE) inference fuse expert computation with fine-grained, GPU-initiated communication into a single persistent GPU kernel, and outperform collective-based MoE on a single node by overlapping data transfer with compute at tile granularity. This benefit does not carry over cleanly to multi-node inference, where experts span many nodes connected by an RDMA fabric. Communication-bound MoE models regress by up to $10\times$ on 8 nodes, and the regression worsens with node count. We trace this regression to hidden serialization in proxy-based RDMA transports. The ordering requirement between each tile transfer and its completion signal forces a fence that drains the NIC pipeline, and its cost grows with the number of concurrent transfers. As a result, models whose per-expert compute is too small to absorb this inflated network latency expose communication on the critical path. We present \emph{Perseus}, which eliminates this serialization through two techniques. \emph{Decoupled signaling} batches fences at per-destination granularity, reducing fence count by $8\times$. \emph{NIC-side ordering} replaces proxy stalls with hardware fence flags, so the proxy never blocks. On proxy-based transports, Perseus achieves up to 10.3$\times$ end-to-end speedup. Perseus on IBRC matches or exceeds IBGDA GPU-direct by up to 1.2$\times$, which shows that serialization, rather than the choice between proxy-based and GPU-direct transport, is what bounds multi-node megakernel performance.
Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts
multi-node inference
RDMA
hidden serialization
megakernel
Innovation

Methods, ideas, or system contributions that make the work stand out.

Megakernel
Mixture-of-Experts
RDMA
Hidden Serialization
NIC-side Ordering
πŸ”Ž Similar Papers
2024-06-07International Symposium on High-Performance Computer ArchitectureCitations: 5