BEAM: Binary Expert Activation Masking for Dynamic Routing in MoE

πŸ“… 2026-05-14
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

253K/year
πŸ€– AI Summary
This work addresses the inefficiencies of standard Mixture-of-Experts (MoE) models, which employ fixed Top-K routing and suffer from redundant computation and high inference latency. Existing acceleration approaches either require retraining or exhibit significant performance degradation under high sparsity. To overcome these limitations, the authors propose BEAM, a novel method that introduces, for the first time, a learnable binary expert activation mask enabling token-adaptive dynamic sparse routing. BEAM is plug-and-play without architectural modifications and effectively mitigates the training-inference discrepancy. Leveraging a straight-through estimator, auxiliary regularization losses, and custom CUDA kernels, BEAM enables end-to-end training and efficient inference within vLLM. It achieves over 98% of the original model’s performance while reducing MoE-layer FLOPs by 85%, accelerating decoding by 2.5Γ—, and increasing throughput by 1.4Γ—.
πŸ“ Abstract
Mixture-of-Experts (MoE) architectures enhance the efficiency of large language models by activating only a subset of experts per token. However, standard MoE employs a fixed Top-K routing strategy, leading to redundant computation and suboptimal inference latency. Existing acceleration methods either require costly retraining with architectural changes or suffer from severe performance drop at high sparsity due to train-inference mismatch. To address these limitations, we propose BEAM (Binary Expert Activation Masking), a novel method that learns token-adaptive expert selection via trainable binary masks. With a straight-through estimator and an auxiliary regularization loss, BEAM induces dynamic expert sparsity through end-to-end training while maintaining model capability. We further implement an efficient custom CUDA kernel for BEAM, ensuring seamless integration with the vLLM inference framework. Experiments show that BEAM retains over 98\% of the original model's performance while reducing MoE layer FLOPs by up to 85\%, achieving up to 2.5$\times$ faster decoding and 1.4$\times$ higher throughput, demonstrating its effectiveness as a practical, plug-and-play solution for efficient MoE inference.
Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts
dynamic routing
inference latency
expert sparsity
train-inference mismatch
Innovation

Methods, ideas, or system contributions that make the work stand out.

Binary Expert Activation Masking
Dynamic Routing
Mixture-of-Experts
Token-adaptive Sparsity
Efficient Inference
πŸ”Ž Similar Papers