SpecMoE: A Fast and Efficient Mixture-of-Experts Inference via Self-Assisted Speculative Decoding

📅 2026-04-11

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This work addresses the high memory footprint and low parameter efficiency of Mixture-of-Experts (MoE) models during inference, which hinder deployment especially under large-batch or memory-constrained settings. The authors propose SpecMoE, a system that, for the first time, effectively integrates self-aided speculative decoding into MoE inference without requiring any additional training or fine-tuning. By synergistically combining CPU offloading with an expert selection mechanism, SpecMoE co-optimizes memory utilization and interconnect bandwidth. This approach substantially enhances inference efficiency, achieving up to a 4.30× throughput speedup on memory-constrained systems while significantly reducing resource consumption.

Technology Category

Application Category

📝 Abstract

The Mixture-of-Experts (MoE) architecture has emerged as a promising approach to mitigate the rising computational costs of large language models (LLMs) by selectively activating parameters. However, its high memory requirements and sub-optimal parameter efficiency pose significant challenges for efficient deployment. Although CPU-offloaded MoE inference systems have been proposed in the literature, they offer limited efficiency, particularly for large batch sizes. In this work, we propose SpecMoE, a memory-efficient MoE inference system based on our self-assisted speculative decoding algorithm. SpecMoE demonstrates the effectiveness of applying speculative decoding to MoE inference without requiring additional model training or fine-tuning. Our system improves inference throughput by up to $4.30\times$, while significantly reducing bandwidth requirements of both memory and interconnect on memory-constrained systems.

Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts

inference efficiency

memory constraints

large language models

computational cost

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts

speculative decoding

memory-efficient inference