SERE: Similarity-based Expert Re-routing for Efficient Batch Decoding in MoE Models

📅 2026-02-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the memory bottleneck in Mixture-of-Experts (MoE) models during batched inference, which arises from excessive expert activation and hinders decoding efficiency. To mitigate this, the authors propose SERE, a novel input-aware dynamic expert skipping mechanism that leverages similarity analysis to identify and reroute redundant experts, thereby adaptively reducing the number of activated experts without incurring the performance degradation associated with static pruning. Integrated with custom CUDA kernels, SERE enables seamless one-click deployment within vLLM. Experimental results across multiple complex reasoning benchmarks demonstrate up to 2.0× speedup while maintaining negligible quality loss, significantly enhancing the deployment efficiency and practicality of MoE models.

Technology Category

Application Category

📝 Abstract
Mixture-of-Experts (MoE) architectures employ sparse activation to deliver faster training and inference with higher accuracy than dense LLMs. However, in production serving, MoE models require batch inference to optimize hardware efficiency, which may cause excessive expert activation and thus slow the memory-bound decoding stage. To address the fundamental tension between batch decoding and expert sparsity, we present SERE, a Similarity-based Expert Re-routing method for Efficient batch decoding in MoE models. SERE dynamically reduces the number of active experts in an input-aware manner by re-routing tokens from secondary experts to their most similar primary counterparts. It also leverages similarity patterns to identify and preserve critical experts, thereby preventing capability loss. Notably, SERE avoids static expert pruning or merging, instead enabling dynamic expert skipping based on batch-level expert redundancy. Additionally, we provide an efficient custom CUDA kernel for SERE, enabling plug-and-play use in vLLM with only a single-line code change. Extensive experiments on various complex reasoning benchmarks demonstrate that SERE achieves up to 2.0x speedup with minimal quality loss, providing a practical solution for cost-efficient and latency-sensitive large-scale MoE deployment. Code implementation of SERE can be found in https://github.com/JL-Cheng/SERE.
Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts
batch decoding
expert activation
memory-bound decoding
hardware efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts
expert re-routing
batch decoding
dynamic expert skipping
similarity-based routing
🔎 Similar Papers
No similar papers found.