Diagnosing Overhead in Dispatch Operations: Cross-architecture Observatory

📅 2026-05-20

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

All-to-all communication has become the primary bottleneck in Mixture-of-Experts (MoE) expert parallelism, yet existing optimizations rely on two unverified assumptions: that routing imbalance can be corrected at the system level and that synthetic tokens accurately reflect real routing behavior. This work introduces DODOCO, a cross-architecture observational platform, to systematically measure the true routing dynamics of five MoE architectures—MLA, MHA, GQA, Mamba-2, and GDN—across diverse datasets and parallel scales. The study reveals for the first time that routing imbalance stems from the model’s own routing decisions rather than expert placement, and demonstrates that synthetic tokens significantly overestimate routing concentration (by up to 2.35× in Gini coefficient) while exhibiting spurious batch-size scaling trends absent in real text. Building on these insights, the paper proposes a new design paradigm that partitions architectures into “stability bands” based on their intrinsic routing characteristics, offering principled guidance for interconnect and scheduling optimization.

📝 Abstract

AlltoAll dispatch is the dominant bottleneck of MoE expert parallelism, and the interconnect community has responded with four families of mitigations: predictive sample placement, adaptive expert relayout, hierarchical collectives, and EP-aware topology. All four rest on two assumptions about the workload. The first is that routing imbalance is correctable by the system layer. The second is that the mock-token benchmarks evaluating them faithfully represent production routing. We introduce DODOCO to test both assumptions. We instrument five MoE checkpoints spanning five sequence-mixer designs (DeepSeek-V2-Lite MLA, DeepSeek-MoE-16B MHA, Qwen3-30B GQA, Nemotron-30B Mamba-2, Qwen3.5-35B GDN) under a 5 by 6 grid of data conditions plus a matched EP scan from 4 to 32 ranks on H100s; both assumptions fail. Scaling EP changes the per-expert max/mean token ratio by at most 5% within every architecture's measurable range: the straggler is intrinsic to the routing decision the model makes, not to how its experts land on ranks. Mock tokens overestimate routing Gini by up to a factor of 2.35 and fabricate a batch-size scaling trend that vanishes the moment real text replaces random IDs. A third pattern, unexpected, emerges from the same matrix: the five architectures cleave into two stable bands. MHA and Mamba-2 (data-resilient) drop to Gini 0.105 and 0.150 on wikitext. MLA and GDN (persistently concentrated) stay above 0.24 on every real-text condition and reach 0.29 to 0.38 on mock. GQA is the intermediate case. These bands, not the EP degree or the mock-data profile, are the right workload input to AlltoAll-aware interconnect and dispatch design.

Problem

Research questions and friction points this paper is trying to address.

MoE

AlltoAll

routing imbalance

mock-token benchmark

expert parallelism

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts

AlltoAll communication

routing imbalance