Dynamic Sparse Attention on Mobile SoCs

📅 2025-08-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In mobile LLM inference, attention operators are often quantization-sensitive and thus offloaded from NPUs to CPUs/GPUs, degrading performance and increasing scheduling complexity. This paper proposes shadowAttn, a dynamic sparse attention mechanism tailored for NPUs. Its core innovation lies in implicitly estimating token importance via NPU-native pointwise computations, coupled with computation-graph binning, head-level NPU–CPU/GPU pipelining, and per-head fine-grained sparsity ratio control—achieving high-accuracy sparse attention with minimal CPU/GPU overhead. Experiments on mainstream mobile NPUs show that shadowAttn matches the inference accuracy and throughput of state-of-the-art frameworks while significantly reducing general-purpose processor resource usage (up to 62% lower CPU/GPU load), thereby enhancing real-time responsiveness and energy efficiency of on-device LLMs.

Technology Category

Application Category

📝 Abstract
On-device running Large Language Models (LLMs) is nowadays a critical enabler towards preserving user privacy. We observe that the attention operator falls back from the special-purpose NPU to the general-purpose CPU/GPU because of quantization sensitivity in state-of-the-art frameworks. This fallback results in a degraded user experience and increased complexity in system scheduling. To this end, this paper presents shadowAttn, a system-algorithm codesigned sparse attention module with minimal reliance on CPU/GPU by only sparsely calculating the attention on a tiny portion of tokens. The key idea is to hide the overhead of estimating the important tokens with a NPU-based pilot compute. Further, shadowAttn proposes insightful techniques such as NPU compute graph bucketing, head-wise NPU-CPU/GPU pipeline and per-head fine-grained sparsity ratio to achieve high accuracy and efficiency. shadowAttn delivers the best performance with highly limited CPU/GPU resource; it requires much less CPU/GPU resource to deliver on-par performance of SoTA frameworks.
Problem

Research questions and friction points this paper is trying to address.

Optimizing attention computation for on-device LLMs
Reducing CPU/GPU dependency in sparse attention modules
Improving efficiency while maintaining model accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse attention on tiny token subsets
NPU-based pilot compute hides overhead
Head-wise pipeline and fine-grained sparsity
🔎 Similar Papers
No similar papers found.
Wangsong Yin
Wangsong Yin
Peking University
Daliang Xu
Daliang Xu
Peking university
mobile computingsystem software
M
Mengwei Xu
State Key Laboratory of Networking and Switching Technology (BUPT), Beijing, China
G
Gang Huang
Key Lab of High Confidence Software Technologies (Peking University), Beijing, China
Xuanzhe Liu
Xuanzhe Liu
Boya Distinguished Professor, Peking University, ACM Distinguished Scientist
Machine Learning SystemMobile Computing SystemServerless Computing