When NPUs Are Not Always Faster: A Stage-Level Analysis of Mobile LLM Inference

📅 2026-05-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the lack of systematic analysis of neural processing unit (NPU) efficiency at both operator and pipeline levels in on-device large language model (LLM) inference. The authors propose an OPMASK-driven controlled pipeline decomposition approach, integrating stage-level performance profiling with fine-grained energy and latency measurements on heterogeneous SoCs, thereby revealing for the first time the critical bottlenecks in CPU-NPU collaborative inference. Experimental results demonstrate that the CPU can be up to 1.6× faster during the Prefill stage, while NPU acceleration in the Decode stage yields only modest speedups of 1.05–1.2×. Moreover, due to scheduling overhead and cross-backend fallbacks, naively offloading computation to the NPU can increase energy consumption by up to 51%. This work provides novel insights and methodological foundations for efficient LLM deployment on mobile devices.
📝 Abstract
Deploying large language models (LLMs) on mobile devices increasingly relies on heterogeneous execution, yet no prior study has systematically characterized NPU effectiveness at the operator and pipeline level. We present the first stage-aware, multi-level benchmarking study of mobile LLM inference on a CPU-NPU heterogeneous SoC. We introduce an OPMASK-based controlled pipeline decomposition methodology that isolates communication, quantization, and computation overheads within the NPU execution path. Our results reveal a counter-intuitive stage-level performance reversal: CPUs outperform NPUs in the compute-intensive Prefill stage (up to 1.6x), while NPUs provide only limited acceleration in the memory-bound Decode stage (1.05-1.2x). We further show that scheduling overhead and cross-backend fallback reduce the practical benefits of NPU offloading. For the energy trend, increasing NPU offloading leads to higher energy consumption (up to 51%). Based on these findings, we derive design guidelines for NPU architects targeting on-device LLM inference.
Problem

Research questions and friction points this paper is trying to address.

NPU
mobile LLM inference
heterogeneous execution
stage-level performance
energy consumption
Innovation

Methods, ideas, or system contributions that make the work stand out.

stage-level analysis
OPMASK-based decomposition
heterogeneous SoC
NPU offloading
mobile LLM inference