Scaling LLM Test-Time Compute with Mobile NPU on Smartphones

📅 2025-09-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the trade-off between low accuracy of small language models (LLMs) and high computational overhead of large ones on mobile devices, this paper proposes a test-time scaling method tailored for smartphone NPUs. Methodologically, it introduces a hardware-aware block-wise mixed-precision quantization scheme that replaces high-cost operators—such as Softmax—with LUT-based approximations and employs grouped quantization to align with NPU memory access patterns. Furthermore, it develops an end-to-end inference system optimized for Qualcomm Snapdragon NPU architectures. Experimental results demonstrate a 19.0× speedup for mixed-precision GEMM and a 2.2× acceleration for Softmax. Critically, the scaled small model achieves accuracy on par with—or even surpassing—that of baseline large models across multiple tasks, significantly advancing the Pareto frontier of performance versus cost. This work enables efficient, low-cost deployment of LLMs on resource-constrained mobile platforms.

Technology Category

Application Category

📝 Abstract
Deploying Large Language Models (LLMs) on mobile devices faces the challenge of insufficient performance in smaller models and excessive resource consumption in larger ones. This paper highlights that mobile Neural Processing Units (NPUs) have underutilized computational resources, particularly their matrix multiplication units, during typical LLM inference. To leverage this wasted compute capacity, we propose applying parallel test-time scaling techniques on mobile NPUs to enhance the performance of smaller LLMs. However, this approach confronts inherent NPU challenges, including inadequate hardware support for fine-grained quantization and low efficiency in general-purpose computations. To overcome these, we introduce two key techniques: a hardware-aware tile quantization scheme that aligns group quantization with NPU memory access patterns, and efficient LUT-based replacements for complex operations such as Softmax and dequantization. We design and implement an end-to-end inference system that leverages the NPU's compute capability to support test-time scaling on Qualcomm Snapdragon platforms. Experiments show our approach brings significant speedups: up to 19.0 for mixed-precision GEMM and 2.2 for Softmax. More importantly, we demonstrate that smaller models using test-time scaling can match or exceed the accuracy of larger models, achieving a new performance-cost Pareto frontier.
Problem

Research questions and friction points this paper is trying to address.

Deploying LLMs on mobile devices faces performance-resource tradeoffs
Mobile NPUs have underutilized computational resources during LLM inference
NPU limitations require solutions for quantization and complex operations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mobile NPU parallel test-time scaling for LLMs
Hardware-aware tile quantization for NPU memory access
LUT-based replacements for Softmax and dequantization operations
🔎 Similar Papers
No similar papers found.
Z
Zixu Hao
Tsinghua University
Jianyu Wei
Jianyu Wei
USTC & MSRA Joint PhD
LLM InfraInference SystemQuantizationKernelCo-design
T
Tuowei Wang
Tsinghua University
M
Minxing Huang
Tsinghua University
Huiqiang Jiang
Huiqiang Jiang
Microsoft Research Asia
Efficient AILLMsMLSys
S
Shiqi Jiang
Microsoft Research
T
Ting Cao
Institute for AI Industry Research (AIR), Tsinghua University
Ju Ren
Ju Ren
Department of Computer Science and Technology, Tsinghua University
Internet-of-ThingsEdge Computing/IntelligenceSecurity and Privacy