From Score Distributions to Balance: Plug-and-Play Mixture-of-Experts Routing

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

In MoE model inference, token routing causes expert load imbalance, degrading latency, throughput, and hardware utilization. This paper proposes LASER—a plug-and-play, training-free inference-time routing algorithm. Its core innovation is dynamic modeling of the gating score distribution: under score concentration, it prioritizes high-confidence experts; under score dispersion, it proactively schedules underloaded experts—enabling load-aware conditional routing. LASER integrates seamlessly into existing MoE inference pipelines and requires only the original gating outputs—no architectural or training modifications. Evaluated across multiple MoE models and benchmarks, LASER achieves substantial improvements: +28% average throughput, −22% average latency, and up to 47% reduction in expert load standard deviation, with negligible accuracy degradation (<0.1%).

Technology Category

Application Category

📝 Abstract

Mixture-of-Experts (MoE) models can scale parameter capacity by routing each token to a subset of experts through a learned gate function. While conditional routing reduces training costs, it shifts the burden on inference memory: expert parameters and activations consume memory, limiting the number of experts per device. As tokens are routed, some experts become overloaded while others are underutilized. Because experts are mapped to GPUs, this imbalance translates directly into degraded system performance in terms of latency, throughput, and cost. We present LASER, a plug-and-play, inference-time routing algorithm that balances load while preserving accuracy. LASER adapts to the shape of the gate's score distribution. When scores provide a clear preference, it routes to the strongest experts; when scores are more uniform, it broadens the set of viable experts and routes to the least-loaded among them. Because LASER relies only on gate scores from a trained model, it integrates directly into existing MoE inference pipelines without retraining or finetuning. We evaluate LASER on Mixtral-8x7B and DeepSeek-MoE-16b-chat across four datasets (ARC-Easy, ARC-Challenge, MMLU, and GSM8K). LASER improves load balancing, translating into lower latency and higher throughput, while keeping the accuracy changes negligible.

Problem

Research questions and friction points this paper is trying to address.

Balancing expert load in MoE models

Reducing inference memory and latency

Maintaining accuracy without retraining

Innovation

Methods, ideas, or system contributions that make the work stand out.

Plug-and-play routing algorithm balances expert load

Adapts routing based on gate score distribution shape

Integrates without retraining using existing gate scores

🔎 Similar Papers

A Survey on Model MoErging: Recycling and Routing Among Specialized Experts for Collaborative Learning