🤖 AI Summary
To address two key challenges in deploying sparse Mixture-of-Experts (SMoE) models for online inference on edge devices—namely, deployment inefficiency and inaccurate expert routing without task labels—this paper proposes a task-aware expert merging framework. Without requiring explicit task annotations, the framework dynamically estimates implicit task distributions from historical queries and introduces a Tree-structured Adaptive Neural Bandit Router (Tanbr), which progressively partitions the continuous query space to generate fusion weights for dynamic expert merging. Nonlinear performance mapping learning and pre-trained MoE knowledge transfer are incorporated to preserve model accuracy. Theoretical analysis establishes a sublinear regret bound. Experiments demonstrate that, compared to state-of-the-art methods, our approach reduces inference latency by ≥45%, decreases memory footprint by up to 25%, and maintains original accuracy.
📝 Abstract
Sparse Mixture of Experts (SMoE) has become a preferred architecture for scaling Transformer capacity without increasing computational cost, as it activates only a small subset of experts for each input. However, deploying such an approach for extit{online inference} remains challenging due to the large size of a full SMoE model and the complexity of expert routing, especially in resource-constrained edge networks. Moreover, during the online inference, task information is often unavailable, making the task-level routing error-prone. In this work, we propose a novel tree-structured adaptive neural bandit router, exttt{Tanbr}, to enable efficient and reliable online MoE inference. Instead of relying on explicit task tags, exttt{Tanbr} estimates the task distribution over time from historical data and uses it to guide task-aware expert merging within a given pre-trained MoE. To handle the large continuous space of merging weights, exttt{Tanbr} employs a binary tree to progressively partition the space and generate finer candidate weights. It then applies a neural bandit to learn the non-linear mapping from merging weight to model performance and decides optimal expert merging. We prove that exttt{Tanbr} achieves a sublinear regret bound of {small $mathcal{O}(sqrt{T} log(T))$} over {small $T$} rounds, despite operating over a continuous decision space, matching regret bounds compared to existing methods. Extensive experiments show that exttt{Tanbr} reduces inference latency by at least {small $45%$} and memory usage by up to {small $25%$}, while maintaining a high accuracy compared to many state-of-the-art methods.