Accurate Expert Predictions in MoE Inference via Cross-Layer Gate

📅 2025-02-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Sparse Mixture-of-Experts (MoE) models face significant challenges in edge deployment, including inaccurate expert routing, high memory overhead, and elevated inference latency. This paper proposes an efficient sparse MoE inference framework tailored for edge scenarios. First, it introduces a novel cross-layer gating mechanism that enables zero-overhead, high-accuracy expert prefetching. Second, it designs a shallow-layer preference caching strategy achieving a 99% expert hit rate. Third, it integrates hierarchical quantization with GPU-CPU collaborative offloading scheduling to substantially improve I/O and cache efficiency. Experimental results demonstrate that, without compromising inference quality, the framework accelerates prefill and decoding phases by up to 4.5× and 4.1×, respectively, while exhibiting scalable performance under varying memory budgets.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have demonstrated impressive performance across various tasks, and their application in edge scenarios has attracted significant attention. However, sparse-activated Mixture-of-Experts (MoE) models, which are well suited for edge scenarios, have received relatively little attention due to their high memory demands. Offload-based methods have been proposed to address this challenge, but they face difficulties with expert prediction. Inaccurate expert predictions can result in prolonged inference delays. To promote the application of MoE models in edge scenarios, we propose Fate, an offloading system designed for MoE models to enable efficient inference in resource-constrained environments. The key insight behind Fate is that gate inputs from adjacent layers can be effectively used for expert prefetching, achieving high prediction accuracy without additional GPU overhead. Furthermore, Fate employs a shallow-favoring expert caching strategy that increases the expert hit rate to 99%. Additionally, Fate integrates tailored quantization strategies for cache optimization and IO efficiency. Experimental results show that, compared to Load on Demand and Expert Activation Path-based method, Fate achieves up to 4.5x and 1.9x speedups in prefill speed and up to 4.1x and 2.2x speedups in decoding speed, respectively, while maintaining inference quality. Moreover, Fate's performance improvements are scalable across different memory budgets.
Problem

Research questions and friction points this paper is trying to address.

Enhances MoE model efficiency
Reduces inference delays
Optimizes expert prefetching accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-layer gate for expert prefetching
Shallow-favoring expert caching strategy
Tailored quantization for cache optimization
🔎 Similar Papers
No similar papers found.
Z
Zhiyuan Fang
Sun Yat-sen University, Zhuhai, China
Zicong Hong
Zicong Hong
Department of Computer Science and Engineering, Hong Kong University of Science and Technology
BlockchainML SystemEdge/Cloud Computing
Y
Yuegui Huang
Sun Yat-sen University, Guangzhou, China
Y
Yufeng Lyu
Huawei Technologies Co. Ltd, Shenzhen, China
W
Wuhui Chen
Sun Yat-sen University, Zhuhai, China; Peng Cheng Laboratory, Shenzhen, China
Y
Yue Yu
Peng Cheng Laboratory, Shenzhen, China
F
Fan Yu
Huawei Technologies Co. Ltd, Shenzhen, China
Zibin Zheng
Zibin Zheng
IEEE Fellow, Highly Cited Researcher, Sun Yat-sen University, China
BlockchainSmart ContractServices ComputingSoftware Reliability