🤖 AI Summary
This work addresses the challenges of deploying Mixture-of-Experts (MoE) architectures in sub-billion-parameter on-device language models, where mobile memory and compute constraints severely limit applicability. The authors propose MobileMoE, a family of efficient on-device MoE models with 0.3–0.9B active parameters and 1.3–5.3B total parameters. By jointly optimizing sparsity, expert granularity, and parameter sharing—and by establishing, for the first time, MoE scaling laws tailored to edge devices—they identify a “sweet spot” architecture featuring moderate sparsity and fine-grained shared experts. Combined with a four-stage training pipeline and INT4 quantization, MobileMoE outperforms leading dense models on 14 benchmarks with 2–4× fewer FLOPs, reduces total parameters by up to 60% compared to OLMoE-1B-7B, and achieves 1.8–3.8× faster prefill and 2.2–3.4× faster decoding speeds than MobileLLM-Pro, enabling efficient MoE inference on commercial smartphones for the first time.
📝 Abstract
Mixture-of-Experts (MoE) has become the de facto architecture for hundred-billion-parameter language models, yet its advantages at sub-billion scales for on-device deployment remain largely unexplored. To close this gap, we present MobileMoE, a family of on-device MoE language models with sub-billion active parameters (0.3-0.9B active and 1.3-5.3B total) that establish a new Pareto frontier for on-device LLMs. We first formulate an on-device MoE scaling law that jointly optimizes MoE architecture under mobile memory and compute constraints, identifying an on-device sweet spot - moderate sparsity with fine-grained and shared experts - that is simultaneously memory and compute-optimal. Building on the derived architectures, we train MobileMoE with a four-stage recipe covering pre-training, mid-training, instruction fine-tuning, and quantization-aware training, all on open-source datasets. Across 14 benchmarks, MobileMoE matches or exceeds leading on-device dense LLMs with 2-4$\times$ fewer inference FLOPs, and matches or surpasses the state-of-the-art MoE OLMoE-1B-7B with up to 60% fewer parameters. To bridge the last mile to mobile deployment, we provide the first efficient MoE inference on commodity smartphones with comprehensive on-device profiling. At comparable INT4 weight memory, MobileMoE-S delivers $1.8$-$3.8\times$ faster prefill and $2.2$-$3.4\times$ faster decode than the dense baseline MobileLLM-Pro.