Accelerating Mixture-of-Experts Inference by Hiding Offloading Latency with Speculative Decoding

📅 2025-08-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address low hardware utilization and难以-hiding I/O latency in memory-constrained offloading of Mixture-of-Experts (MoE) inference, this paper proposes a speculative decoding–based GPU-CPU collaborative offloading framework. The method leverages a lightweight draft model to increase expert computation load—marking the first application of speculative decoding to MoE offloading—and designs a CPU-side blocked attention verification kernel to reduce verification overhead. System-level optimization is achieved via theory-driven Roofline analysis and an automated hyperparameter tuner. Experiments demonstrate that, while preserving model accuracy, the approach achieves up to 2.5× higher decoding throughput compared to the state-of-the-art MoE offloading schemes, significantly improving end-to-end hardware utilization and inference efficiency.

Technology Category

Application Category

📝 Abstract
Recent advancements in Mixture of Experts (MoE) models have significantly increased their parameter scale as well as model performance. Extensive offloading techniques have been proposed to address the GPU memory limitations of MoE inference. However, due to the I/O bottleneck and sparse computation of MoE models, existing offloading techniques still suffer from low hardware utilization. To fully utilize the hardware resources, we propose SpecMoEOff, which employs the speculative decoding technique to enlarge the workload of each expert. SpecMoEOff orchestrates the GPU and CPU by both theoretical and empirical roofline analysis. In addition, we develop a dedicated CPU chunked attention verification kernel to fit the speculative decoding in offloading scenarios as well as minimizing the additional overhead led by draft models. SpecMoEOff further integrates an optimizer to automatically tune the hyperparameters of speculative decoding under given hardware and workload. Experimental results show that SpecMoEOff achieves up to 2.5x decode throughput improvement over the state-of-the-art MoE offloading techniques.
Problem

Research questions and friction points this paper is trying to address.

Overcoming low hardware utilization in MoE offloading inference
Addressing I/O bottlenecks in sparse MoE computation
Minimizing overhead from draft models in speculative decoding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses speculative decoding to hide offloading latency
Develops CPU chunked attention verification kernel
Integrates optimizer for automatic hyperparameter tuning
🔎 Similar Papers
No similar papers found.
Zhibin Wang
Zhibin Wang
Zhejiang University
new particle formationaerosolshygroscopicityblack carbon
Z
Zhonghui Zhang
State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China
Y
Yuhang Zhou
State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China
Z
Zibo Wang
State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China
M
Mo Zhou
State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China
P
Peng Jiang
State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China
Weilin Cai
Weilin Cai
The Hong Kong University of Science and Technology (Guangzhou)
Machine Learning SystemsHigh Performance ComputingArtificial Intelligence
C
Chengying Huan
State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China
Rong Gu
Rong Gu
Mälardalen University
Formal MethodsMachine LearningAutonomous Systems
Sheng Zhong
Sheng Zhong
Nanjing University
computer networkssecurity and privacytheory of computing
Chen Tian
Chen Tian
Prof. of Nanjing University
Data Center NetworkingNetwork Function VirtualisationContent Distribution