Not All Models Suit Expert Offloading: On Local Routing Consistency of Mixture-of-Expert Models

📅 2025-05-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In deploying Mixture-of-Experts (MoE) large language models on memory-constrained devices, expert offloading suffers from uncertain local routing consistency—i.e., the degree to which consecutive tokens activate similar experts—thereby degrading cache efficiency. Method: We introduce two novel quantitative metrics—Spatial Routing Persistence (SRP) and Spatial Consistency Homogeneity (SCH)—and systematically analyze routing behavior across 20 MoE models. We further model cache hit rates and conduct empirical evaluations under varying cache capacities. Contribution/Results: We find that fully layered MoE architectures without shared experts exhibit the highest routing consistency; domain-specialized experts significantly outperform vocabulary-specialized ones. Empirical results show that allocating cache capacity approximately twice the number of active experts achieves an optimal trade-off between efficiency and accuracy. Our work provides reproducible theoretical foundations and open-source analytical tools for efficient MoE architecture design and deployment on resource-limited hardware.

Technology Category

Application Category

📝 Abstract
Mixture-of-Experts (MoE) enables efficient scaling of large language models (LLMs) with sparsely activated experts during inference. To effectively deploy large MoE models on memory-constrained devices, many systems introduce *expert offloading* that caches a subset of experts in fast memory, leaving others on slow memory to run on CPU or load on demand. While some research has exploited the locality of expert activations, where consecutive tokens activate similar experts, the degree of this **local routing consistency** varies across models and remains understudied. In this paper, we propose two metrics to measure local routing consistency of MoE models: (1) **Segment Routing Best Performance (SRP)**, which evaluates how well a fixed group of experts can cover the needs of a segment of tokens, and (2) **Segment Cache Best Hit Rate (SCH)**, which measures the optimal segment-level cache hit rate under a given cache size limit. We analyzed 20 MoE LLMs with diverse sizes and architectures and found that models that apply MoE on every layer and do not use shared experts exhibit the highest local routing consistency. We further showed that domain-specialized experts contribute more to routing consistency than vocabulary-specialized ones, and that most models can balance between cache effectiveness and efficiency with cache sizes approximately 2x the active experts. These findings pave the way for memory-efficient MoE design and deployment without compromising inference speed. We publish the code for replicating experiments at https://github.com/ljcleo/moe-lrc .
Problem

Research questions and friction points this paper is trying to address.

Measures local routing consistency in MoE models
Evaluates expert offloading efficiency for memory-constrained devices
Identifies optimal cache size for balancing effectiveness and efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes SRP and SCH metrics for MoE routing consistency
Analyzes 20 MoE LLMs for local routing patterns
Recommends 2x cache size for efficiency-effectiveness balance
🔎 Similar Papers
No similar papers found.
Jingcong Liang
Jingcong Liang
Fudan University
Computational ArgumentationLarge Language Model
S
Siyuan Wang
University of Southern California
M
Miren Tian
Huawei Technologies Ltd.
Y
Yitong Li
Huawei Technologies Ltd.
Duyu Tang
Duyu Tang
Huawei
Natural Language Processing
Z
Zhongyu Wei
Fudan University