Efficient Mixture-of-Experts LLM Inference with Apple Silicon NPUs

📅 2026-04-20
📈 Citations: 0
Influential: 0
📄 PDF

career value

241K/year
🤖 AI Summary
This work addresses the inefficiency of Mixture-of-Experts (MoE) large models on Apple Neural Engine (ANE) due to dynamic routing, irregular operators, and scheduling overhead from small expert kernels, which hinder effective NPU utilization. The authors propose NPUMoE, a runtime engine that enables, for the first time, highly efficient MoE inference on Apple Silicon NPUs. By leveraging offline calibration–guided static expert capacity tiering, grouped execution, and load-aware computation graph residency—combined with coordinated CPU/GPU scheduling for dynamic logic and NPU acceleration of static dense computations—the approach overcomes the NPU’s inherent limitations in supporting dynamic sparse computation. Evaluated on Apple M-series chips, NPUMoE achieves 1.32–5.55× lower end-to-end latency, 1.81–7.37× higher energy efficiency, and 1.78–5.54× fewer CPU cycles compared to baseline implementations.

Technology Category

Application Category

📝 Abstract
Apple Neural Engine (ANE) is a dedicated neural processing unit (NPU) present in every Apple Silicon chip. Mixture-of-Experts (MoE) LLMs improve inference efficiency via sparse activation but are challenging for NPUs in three ways: expert routing is unpredictable and introduces dynamic tensor shapes that conflict with the shape-specific constraints of NPUs; several irregular operators, e.g., top-k, scatter/gather, etc., are not NPU-friendly; and launching many small expert kernels incurs substantial dispatch and synchronization overhead. NPUs are designed to offload AI compute from CPU and GPU; our goal is to enable such offloading for MoE inference, particularly during prefill, where long-context workloads consume substantial system resources. This paper presents NPUMoE, a runtime inference engine that accelerates MoE execution on Apple Silicon by offloading dense, static computation to NPU, while preserving a CPU/GPU fallback path for dynamic operations. NPUMoE uses offline calibration to estimate expert capacity and popularity that drives three key techniques: (1) Static tiers for expert capacity to address dynamic expert routing (2) Grouped expert execution to mitigate NPU concurrency limits (3) Load-aware expert compute graph residency to reduce CPU-NPU synchronization overhead. Experiments on Apple M-series devices using three representative MoE LLMs and four long-context workloads show that NPUMoE consistently outperforms baselines, reducing latency by 1.32x-5.55x, improving energy efficiency by 1.81x-7.37x, and reducing CPU-cycle usage by 1.78x-5.54x through effective NPU offloading.
Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts
LLM inference
Apple Silicon NPU
dynamic tensor shapes
expert routing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts
NPU offloading
Apple Silicon
static expert tiers
grouped expert execution