When NPUs Are Not Always Faster: A Stage-Level Analysis of Mobile LLM Inference

📅 2026-05-22

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This study addresses the lack of systematic analysis of neural processing unit (NPU) efficiency at both operator and pipeline levels in on-device large language model (LLM) inference. The authors propose an OPMASK-driven controlled pipeline decomposition approach, integrating stage-level performance profiling with fine-grained energy and latency measurements on heterogeneous SoCs, thereby revealing for the first time the critical bottlenecks in CPU-NPU collaborative inference. Experimental results demonstrate that the CPU can be up to 1.6× faster during the Prefill stage, while NPU acceleration in the Decode stage yields only modest speedups of 1.05–1.2×. Moreover, due to scheduling overhead and cross-backend fallbacks, naively offloading computation to the NPU can increase energy consumption by up to 51%. This work provides novel insights and methodological foundations for efficient LLM deployment on mobile devices.

📝 Abstract

Deploying large language models (LLMs) on mobile devices increasingly relies on heterogeneous execution, yet no prior study has systematically characterized NPU effectiveness at the operator and pipeline level. We present the first stage-aware, multi-level benchmarking study of mobile LLM inference on a CPU-NPU heterogeneous SoC. We introduce an OPMASK-based controlled pipeline decomposition methodology that isolates communication, quantization, and computation overheads within the NPU execution path. Our results reveal a counter-intuitive stage-level performance reversal: CPUs outperform NPUs in the compute-intensive Prefill stage (up to 1.6x), while NPUs provide only limited acceleration in the memory-bound Decode stage (1.05-1.2x). We further show that scheduling overhead and cross-backend fallback reduce the practical benefits of NPU offloading. For the energy trend, increasing NPU offloading leads to higher energy consumption (up to 51%). Based on these findings, we derive design guidelines for NPU architects targeting on-device LLM inference.

Problem

Research questions and friction points this paper is trying to address.

NPU

mobile LLM inference

heterogeneous execution

stage-level performance

energy consumption

Innovation

Methods, ideas, or system contributions that make the work stand out.

stage-level analysis

OPMASK-based decomposition

heterogeneous SoC