🤖 AI Summary
To address memory constraints and high latency in autoregressive decoding of Mixture-of-Experts (MoE) large language models—caused by excessive expert activations per token—this paper proposes a training-free, dynamic token-to-expert rerouting mechanism. The core innovation is a batch-aware routing strategy: leveraging real-time per-expert load information within the current batch, it opportunistically reroutes incoming tokens to already-activated but underutilized experts, thereby improving expert reuse and reducing memory access overhead. This method significantly decreases the number of activated experts per decoding step without requiring model retraining. Evaluated on Qwen3-30B and Qwen3-235B, it achieves 39% and 15% latency reduction in MoE-layer decoding, respectively, while preserving generation quality with no statistically significant degradation.
📝 Abstract
An increasing number of LLMs employ Mixture-of-Experts (MoE) architectures where the feed-forward layer is replaced by a pool of experts and each token only activates a small subset of them. During autoregressive generation, these models often enter a memory-bound regime even for moderate batch sizes because the average expert load grows more slowly than in an equivalent dense feedforward layer. Consequently, MoE latency is governed by the number of activated experts. We introduce a framework for dynamically re-routing token-to-expert mapping to lower this number (and thus, the decode latency) while preserving a comparable quality. Our best results use a batch-aware routing that works by having tokens piggyback experts that have already been loaded into memory due to being crucial to other tokens within the same batch. Empirically, we evaluate our method on the Qwen3-30B and Qwen3-235B models with a batch size of $16$. Without any statistically significant loss in accuracy, our approach achieves latency reductions of $39%$ and $15%$ in the MoE layer decode latency, respectively.