Opportunistic Expert Activation: Batch-Aware Expert Routing for Faster Decode Without Retraining

📅 2025-11-04

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

To address memory constraints and high latency in autoregressive decoding of Mixture-of-Experts (MoE) large language models—caused by excessive expert activations per token—this paper proposes a training-free, dynamic token-to-expert rerouting mechanism. The core innovation is a batch-aware routing strategy: leveraging real-time per-expert load information within the current batch, it opportunistically reroutes incoming tokens to already-activated but underutilized experts, thereby improving expert reuse and reducing memory access overhead. This method significantly decreases the number of activated experts per decoding step without requiring model retraining. Evaluated on Qwen3-30B and Qwen3-235B, it achieves 39% and 15% latency reduction in MoE-layer decoding, respectively, while preserving generation quality with no statistically significant degradation.

Technology Category

Application Category

📝 Abstract

An increasing number of LLMs employ Mixture-of-Experts (MoE) architectures where the feed-forward layer is replaced by a pool of experts and each token only activates a small subset of them. During autoregressive generation, these models often enter a memory-bound regime even for moderate batch sizes because the average expert load grows more slowly than in an equivalent dense feedforward layer. Consequently, MoE latency is governed by the number of activated experts. We introduce a framework for dynamically re-routing token-to-expert mapping to lower this number (and thus, the decode latency) while preserving a comparable quality. Our best results use a batch-aware routing that works by having tokens piggyback experts that have already been loaded into memory due to being crucial to other tokens within the same batch. Empirically, we evaluate our method on the Qwen3-30B and Qwen3-235B models with a batch size of $16$. Without any statistically significant loss in accuracy, our approach achieves latency reductions of $39%$ and $15%$ in the MoE layer decode latency, respectively.

Problem

Research questions and friction points this paper is trying to address.

Reducing MoE model latency by optimizing expert activation during decoding

Dynamically rerouting tokens to minimize activated experts per batch

Maintaining model quality while significantly improving inference speed

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic token-to-expert routing reduces activated experts

Batch-aware routing piggybacks pre-loaded experts in memory

Method preserves model quality while cutting decode latency

🔎 Similar Papers

Routing Experts: Learning to Route Dynamic Experts in Multi-modal Large Language Models