MoEs Are Stronger than You Think: Hyper-Parallel Inference Scaling with RoE

📅 2025-09-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the low per-token prediction quality of large language models (LLMs) during inference—unamenable to improvement via parameter fine-tuning—this paper proposes RoE, a hyper-parallel inference framework. RoE innovatively repurposes the Mixture-of-Experts (MoE) architecture at inference time as a training-free expert ensemble: it dynamically generates diverse expert proposals via controllable stochastic routing and fuses their outputs. Integrated with multi-proposal aggregation, efficient batched execution, and a dedicated KV cache mechanism, RoE simultaneously reduces computational and memory overhead while enhancing prediction accuracy. Experiments demonstrate that RoE enables a 7B MoE model to outperform a 10.5B dense model in downstream task performance, achieves a 30% reduction in inference FLOPs, and requires zero parameter updates throughout deployment.

Technology Category

Application Category

📝 Abstract
The generation quality of large language models (LLMs) is often improved by utilizing inference-time sequence-level scaling methods (e.g., Chain-of-Thought). We introduce hyper-parallel scaling, a complementary framework that improves prediction quality at the token level. Hyper-parallel scaling computes and aggregates multiple output proposals for a single token from the model. We implement this concept in Mixture-of-Experts (MoE) models, which we refer to as Roster of Experts (RoE). RoE is a training-free inference algorithm that turns a single MoE into a dynamic ensemble of MoEs. RoE injects controlled stochasticity into the expert routing mechanism, enabling it to sample multiple diverse experts for each token and aggregate their outputs for a more accurate final prediction.To overcome the computational cost, we introduce an efficient batching strategy and a specialized KV-caching mechanism that minimizes compute and memory overhead. For example, RoE enables a 7B MoE model to match the performance of a 10.5B MoE model while using 30% less compute for inference. These gains are achieved without any fine-tuning of model parameters.
Problem

Research questions and friction points this paper is trying to address.

Improving token-level prediction quality in large language models
Reducing computational costs of multiple expert proposals in MoEs
Enhancing MoE model performance without fine-tuning parameters
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hyper-parallel scaling aggregates multiple token proposals
RoE algorithm creates dynamic ensemble via stochastic routing
Efficient batching and KV-caching reduce computational overhead
🔎 Similar Papers
No similar papers found.