SERE: Similarity-based Expert Re-routing for Efficient Batch Decoding in MoE Models

📅 2026-02-07

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

This work addresses the memory bottleneck in Mixture-of-Experts (MoE) models during batched inference, which arises from excessive expert activation and hinders decoding efficiency. To mitigate this, the authors propose SERE, a novel input-aware dynamic expert skipping mechanism that leverages similarity analysis to identify and reroute redundant experts, thereby adaptively reducing the number of activated experts without incurring the performance degradation associated with static pruning. Integrated with custom CUDA kernels, SERE enables seamless one-click deployment within vLLM. Experimental results across multiple complex reasoning benchmarks demonstrate up to 2.0× speedup while maintaining negligible quality loss, significantly enhancing the deployment efficiency and practicality of MoE models.

Technology Category

Application Category

📝 Abstract

Mixture-of-Experts (MoE) architectures employ sparse activation to deliver faster training and inference with higher accuracy than dense LLMs. However, in production serving, MoE models require batch inference to optimize hardware efficiency, which may cause excessive expert activation and thus slow the memory-bound decoding stage. To address the fundamental tension between batch decoding and expert sparsity, we present SERE, a Similarity-based Expert Re-routing method for Efficient batch decoding in MoE models. SERE dynamically reduces the number of active experts in an input-aware manner by re-routing tokens from secondary experts to their most similar primary counterparts. It also leverages similarity patterns to identify and preserve critical experts, thereby preventing capability loss. Notably, SERE avoids static expert pruning or merging, instead enabling dynamic expert skipping based on batch-level expert redundancy. Additionally, we provide an efficient custom CUDA kernel for SERE, enabling plug-and-play use in vLLM with only a single-line code change. Extensive experiments on various complex reasoning benchmarks demonstrate that SERE achieves up to 2.0x speedup with minimal quality loss, providing a practical solution for cost-efficient and latency-sensitive large-scale MoE deployment. Code implementation of SERE can be found in https://github.com/JL-Cheng/SERE.

Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts

batch decoding

expert activation

memory-bound decoding

hardware efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts

expert re-routing

batch decoding