When Are Experts Misrouted? Counterfactual Routing Analysis in Mixture-of-Experts Language Models

📅 2026-05-08

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work addresses the suboptimality of standard top-k routing in Mixture-of-Experts language models for complex reasoning tasks, where direct evaluation of routing efficacy has been lacking. By freezing model parameters and comparing standard routing against counterfactual alternatives with equivalent computational cost—using next-token prediction probabilities along ground-truth reasoning trajectories as a utility metric—the study reveals that routers perform well on high-confidence tokens but fail at fragile reasoning steps. This limitation stems from training objectives that optimize only the executed path and rely on statistical load balancing. To mitigate this, the authors propose fine-tuning only the final-layer router, which significantly improves pass@K performance on AIME 2024+2025 and HMMT 2025 benchmarks in Qwen3-30B-A3B and GPT-OSS-20B.

📝 Abstract

Mixture-of-Experts (MoE) language models route each token to a small subset of experts, but whether the routes selected by a trained top-$k$ router are good ones is rarely evaluated directly. Holding the model fixed, we compare each standard route against sampled equal-compute alternatives for the same token and score each by the next-token probability it assigns to the realized token in a verified reasoning trajectory. The result is sharply token-conditional: the standard router is well-aligned with route utility on confident tokens but uninformative on the fragile tokens that drive hard reasoning, where lower-loss equal-compute routes consistently exist inside the frozen model but are not selected. The same pattern holds across Qwen3-30B-A3B, GPT-OSS-20B, DeepSeek-V2-Lite, and OLMoE-1B-7B, and follows structurally from how standard top-$k$ training evaluates routing decisions: the language modeling loss scores only the executed route, and load balancing depends only on aggregate routing statistics. A minimal router-only update to the final-layer router, leaving every expert and every other router frozen, is sufficient to shift pass@K on AIME 2024+2025 and HMMT 2025 for both Qwen3-30B-A3B and GPT-OSS-20B, suggesting that at least part of the failure reflects router-reachable misallocation rather than expert capacity alone.

Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts

routing

token routing

language models

expert allocation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts

counterfactual routing

router analysis