Efficient Mixture-of-Experts LLM Inference with Apple Silicon NPUs

📅 2026-04-20

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

This work addresses the inefficiency of Mixture-of-Experts (MoE) large models on Apple Neural Engine (ANE) due to dynamic routing, irregular operators, and scheduling overhead from small expert kernels, which hinder effective NPU utilization. The authors propose NPUMoE, a runtime engine that enables, for the first time, highly efficient MoE inference on Apple Silicon NPUs. By leveraging offline calibration–guided static expert capacity tiering, grouped execution, and load-aware computation graph residency—combined with coordinated CPU/GPU scheduling for dynamic logic and NPU acceleration of static dense computations—the approach overcomes the NPU’s inherent limitations in supporting dynamic sparse computation. Evaluated on Apple M-series chips, NPUMoE achieves 1.32–5.55× lower end-to-end latency, 1.81–7.37× higher energy efficiency, and 1.78–5.54× fewer CPU cycles compared to baseline implementations.

Technology Category

Application Category

📝 Abstract

Apple Neural Engine (ANE) is a dedicated neural processing unit (NPU) present in every Apple Silicon chip. Mixture-of-Experts (MoE) LLMs improve inference efficiency via sparse activation but are challenging for NPUs in three ways: expert routing is unpredictable and introduces dynamic tensor shapes that conflict with the shape-specific constraints of NPUs; several irregular operators, e.g., top-k, scatter/gather, etc., are not NPU-friendly; and launching many small expert kernels incurs substantial dispatch and synchronization overhead. NPUs are designed to offload AI compute from CPU and GPU; our goal is to enable such offloading for MoE inference, particularly during prefill, where long-context workloads consume substantial system resources. This paper presents NPUMoE, a runtime inference engine that accelerates MoE execution on Apple Silicon by offloading dense, static computation to NPU, while preserving a CPU/GPU fallback path for dynamic operations. NPUMoE uses offline calibration to estimate expert capacity and popularity that drives three key techniques: (1) Static tiers for expert capacity to address dynamic expert routing (2) Grouped expert execution to mitigate NPU concurrency limits (3) Load-aware expert compute graph residency to reduce CPU-NPU synchronization overhead. Experiments on Apple M-series devices using three representative MoE LLMs and four long-context workloads show that NPUMoE consistently outperforms baselines, reducing latency by 1.32x-5.55x, improving energy efficiency by 1.81x-7.37x, and reducing CPU-cycle usage by 1.78x-5.54x through effective NPU offloading.

Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts

LLM inference

Apple Silicon NPU

dynamic tensor shapes

expert routing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts

NPU offloading

Apple Silicon