Certain Head, Uncertain Tail: Expert-Sample for Test-Time Scaling in Fine-Grained MoE

📅 2026-02-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of balancing generation diversity and output stability in fine-grained Mixture-of-Experts (MoE) models during inference. It reveals, for the first time, that routing scores exhibit a “deterministic head–uncertain tail” structure: high-confidence experts primarily govern reasoning capability, while low-confidence experts are linked to output diversity. Building on this insight, the authors propose Expert-Sample, a training-free method that preserves the deterministic head while introducing controlled randomness into the uncertain tail, effectively decoupling stability from diversity. Evaluated on models such as Qwen3-30B-A3B-Instruct, the approach improves pass@32 accuracy on the GPQA-Diamond benchmark from 85.4% to 91.9% and boosts Best-of-N accuracy from 59.1% to 62.6%.

Technology Category

Application Category

📝 Abstract
Test-time scaling improves LLM performance by generating multiple candidate solutions, yet token-level sampling requires temperature tuning that trades off diversity against stability. Fine-grained MoE, featuring hundreds of well-trained experts per layer and multi-expert activation per token, offers an unexplored alternative through its rich routing space. We empirically characterize fine-grained MoE routing and uncover an informative pattern: router scores exhibit a certain head of high-confidence experts followed by an uncertain tail of low-confidence candidates. While single-run greedy accuracy remains stable when fewer experts are activated, multi-sample pass@n degrades significantly-suggesting that the certain head governs core reasoning capability while the uncertain tail correlates with reasoning diversity. Motivated by these findings, we propose Expert-Sample, a training-free method that preserves high-confidence selections while injecting controlled stochasticity into the uncertain tail, enabling diverse generation without destabilizing outputs. Evaluated on multiple fine-grained MoE models across math, knowledge reasoning, and code tasks, Expert-Sample consistently improves pass@n and verification-based accuracy. On Qwen3-30B-A3B-Instruct evaluated on GPQA-Diamond with 32 parallel samples, pass@32 rises from 85.4% to 91.9%, and accuracy improves from 59.1% to 62.6% with Best-of-N verification.
Problem

Research questions and friction points this paper is trying to address.

test-time scaling
fine-grained MoE
reasoning diversity
expert routing
pass@n
Innovation

Methods, ideas, or system contributions that make the work stand out.

Expert-Sample
Fine-grained MoE
Test-time scaling
Routing diversity
Pass@n improvement
🔎 Similar Papers
No similar papers found.
Y
Yuanteng Chen
Institute of Automation, Chinese Academy of Sciences, Zhongguancun Acedemy, School of Artificial Intelligence, University of Chinese Academy of Sciences
Peisong Wang
Peisong Wang
CASIA
Deep Neural Network Acceleration and Compression
N
Nanxin Zeng
School of Artificial Intelligence, University of Chinese Academy of Sciences
Y
Yuantian Shao
Nanjing University of Science and Technology, Institute of Automation, Chinese Academy of Sciences
Gang Li
Gang Li
Institute of Automation, Chinese Academy of Sciences
Computer ArchitectureMachine Learning
Jing Liu
Jing Liu
Institute of Theoretical Physics, Chinese Academy of Sciences
Statistical physicsmachine learning
J
Jian Cheng
Institute of Automation, Chinese Academy of Sciences