🤖 AI Summary
This work addresses the high memory footprint and low parameter efficiency of Mixture-of-Experts (MoE) models during inference, which hinder deployment especially under large-batch or memory-constrained settings. The authors propose SpecMoE, a system that, for the first time, effectively integrates self-aided speculative decoding into MoE inference without requiring any additional training or fine-tuning. By synergistically combining CPU offloading with an expert selection mechanism, SpecMoE co-optimizes memory utilization and interconnect bandwidth. This approach substantially enhances inference efficiency, achieving up to a 4.30× throughput speedup on memory-constrained systems while significantly reducing resource consumption.
📝 Abstract
The Mixture-of-Experts (MoE) architecture has emerged as a promising approach to mitigate the rising computational costs of large language models (LLMs) by selectively activating parameters. However, its high memory requirements and sub-optimal parameter efficiency pose significant challenges for efficient deployment. Although CPU-offloaded MoE inference systems have been proposed in the literature, they offer limited efficiency, particularly for large batch sizes. In this work, we propose SpecMoE, a memory-efficient MoE inference system based on our self-assisted speculative decoding algorithm. SpecMoE demonstrates the effectiveness of applying speculative decoding to MoE inference without requiring additional model training or fine-tuning. Our system improves inference throughput by up to $4.30\times$, while significantly reducing bandwidth requirements of both memory and interconnect on memory-constrained systems.