Faster, Smaller, and Smarter: Task-Aware Expert Merging for Online MoE Inference

📅 2025-09-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address two key challenges in deploying sparse Mixture-of-Experts (SMoE) models for online inference on edge devices—namely, deployment inefficiency and inaccurate expert routing without task labels—this paper proposes a task-aware expert merging framework. Without requiring explicit task annotations, the framework dynamically estimates implicit task distributions from historical queries and introduces a Tree-structured Adaptive Neural Bandit Router (Tanbr), which progressively partitions the continuous query space to generate fusion weights for dynamic expert merging. Nonlinear performance mapping learning and pre-trained MoE knowledge transfer are incorporated to preserve model accuracy. Theoretical analysis establishes a sublinear regret bound. Experiments demonstrate that, compared to state-of-the-art methods, our approach reduces inference latency by ≥45%, decreases memory footprint by up to 25%, and maintains original accuracy.

Technology Category

Application Category

📝 Abstract
Sparse Mixture of Experts (SMoE) has become a preferred architecture for scaling Transformer capacity without increasing computational cost, as it activates only a small subset of experts for each input. However, deploying such an approach for extit{online inference} remains challenging due to the large size of a full SMoE model and the complexity of expert routing, especially in resource-constrained edge networks. Moreover, during the online inference, task information is often unavailable, making the task-level routing error-prone. In this work, we propose a novel tree-structured adaptive neural bandit router, exttt{Tanbr}, to enable efficient and reliable online MoE inference. Instead of relying on explicit task tags, exttt{Tanbr} estimates the task distribution over time from historical data and uses it to guide task-aware expert merging within a given pre-trained MoE. To handle the large continuous space of merging weights, exttt{Tanbr} employs a binary tree to progressively partition the space and generate finer candidate weights. It then applies a neural bandit to learn the non-linear mapping from merging weight to model performance and decides optimal expert merging. We prove that exttt{Tanbr} achieves a sublinear regret bound of {small $mathcal{O}(sqrt{T} log(T))$} over {small $T$} rounds, despite operating over a continuous decision space, matching regret bounds compared to existing methods. Extensive experiments show that exttt{Tanbr} reduces inference latency by at least {small $45%$} and memory usage by up to {small $25%$}, while maintaining a high accuracy compared to many state-of-the-art methods.
Problem

Research questions and friction points this paper is trying to address.

Deploying large SMoE models for online inference in resource-constrained edge networks
Handling expert routing complexity when task information is unavailable during inference
Optimizing expert merging to reduce latency and memory while maintaining accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Tree-structured neural bandit router for expert merging
Task distribution estimation without explicit task tags
Binary tree partitioning for continuous weight space optimization
Z
Ziyi Han
The Chinese University of Hong Kong, Hong Kong
Xutong Liu
Xutong Liu
Assistant Professor of Computer Science and Systems, University of Washington
Reinforcement LearningOnline LearningCombinatorial OptimizationNetwork Systems
R
Ruiting Zhou
Southeast University, Nanjing, China
Xiangxiang Dai
Xiangxiang Dai
The Chinese University of Hong Kong
Online LearningBanditsReinforcement Learning
J
John C. S. Lui
The Chinese University of Hong Kong, Hong Kong