Certain Head, Uncertain Tail: Expert-Sample for Test-Time Scaling in Fine-Grained MoE

📅 2026-02-02

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

This work addresses the challenge of balancing generation diversity and output stability in fine-grained Mixture-of-Experts (MoE) models during inference. It reveals, for the first time, that routing scores exhibit a “deterministic head–uncertain tail” structure: high-confidence experts primarily govern reasoning capability, while low-confidence experts are linked to output diversity. Building on this insight, the authors propose Expert-Sample, a training-free method that preserves the deterministic head while introducing controlled randomness into the uncertain tail, effectively decoupling stability from diversity. Evaluated on models such as Qwen3-30B-A3B-Instruct, the approach improves pass@32 accuracy on the GPQA-Diamond benchmark from 85.4% to 91.9% and boosts Best-of-N accuracy from 59.1% to 62.6%.

Technology Category

Application Category

📝 Abstract

Test-time scaling improves LLM performance by generating multiple candidate solutions, yet token-level sampling requires temperature tuning that trades off diversity against stability. Fine-grained MoE, featuring hundreds of well-trained experts per layer and multi-expert activation per token, offers an unexplored alternative through its rich routing space. We empirically characterize fine-grained MoE routing and uncover an informative pattern: router scores exhibit a certain head of high-confidence experts followed by an uncertain tail of low-confidence candidates. While single-run greedy accuracy remains stable when fewer experts are activated, multi-sample pass@n degrades significantly-suggesting that the certain head governs core reasoning capability while the uncertain tail correlates with reasoning diversity. Motivated by these findings, we propose Expert-Sample, a training-free method that preserves high-confidence selections while injecting controlled stochasticity into the uncertain tail, enabling diverse generation without destabilizing outputs. Evaluated on multiple fine-grained MoE models across math, knowledge reasoning, and code tasks, Expert-Sample consistently improves pass@n and verification-based accuracy. On Qwen3-30B-A3B-Instruct evaluated on GPQA-Diamond with 32 parallel samples, pass@32 rises from 85.4% to 91.9%, and accuracy improves from 59.1% to 62.6% with Best-of-N verification.

Problem

Research questions and friction points this paper is trying to address.

test-time scaling

fine-grained MoE

reasoning diversity

expert routing

pass@n

Innovation

Methods, ideas, or system contributions that make the work stand out.

Expert-Sample

Fine-grained MoE

Test-time scaling