Dynamic Sparse Attention on Mobile SoCs

📅 2025-08-22
📈 Citations: 0
Influential: 0
📄 PDF

career value

247K/year
🤖 AI Summary
In mobile LLM inference, attention operators are often quantization-sensitive and thus offloaded from NPUs to CPUs/GPUs, degrading performance and increasing scheduling complexity. This paper proposes shadowAttn, a dynamic sparse attention mechanism tailored for NPUs. Its core innovation lies in implicitly estimating token importance via NPU-native pointwise computations, coupled with computation-graph binning, head-level NPU–CPU/GPU pipelining, and per-head fine-grained sparsity ratio control—achieving high-accuracy sparse attention with minimal CPU/GPU overhead. Experiments on mainstream mobile NPUs show that shadowAttn matches the inference accuracy and throughput of state-of-the-art frameworks while significantly reducing general-purpose processor resource usage (up to 62% lower CPU/GPU load), thereby enhancing real-time responsiveness and energy efficiency of on-device LLMs.

Technology Category

Application Category

📝 Abstract
On-device running Large Language Models (LLMs) is nowadays a critical enabler towards preserving user privacy. We observe that the attention operator falls back from the special-purpose NPU to the general-purpose CPU/GPU because of quantization sensitivity in state-of-the-art frameworks. This fallback results in a degraded user experience and increased complexity in system scheduling. To this end, this paper presents shadowAttn, a system-algorithm codesigned sparse attention module with minimal reliance on CPU/GPU by only sparsely calculating the attention on a tiny portion of tokens. The key idea is to hide the overhead of estimating the important tokens with a NPU-based pilot compute. Further, shadowAttn proposes insightful techniques such as NPU compute graph bucketing, head-wise NPU-CPU/GPU pipeline and per-head fine-grained sparsity ratio to achieve high accuracy and efficiency. shadowAttn delivers the best performance with highly limited CPU/GPU resource; it requires much less CPU/GPU resource to deliver on-par performance of SoTA frameworks.
Problem

Research questions and friction points this paper is trying to address.

Optimizing attention computation for on-device LLMs
Reducing CPU/GPU dependency in sparse attention modules
Improving efficiency while maintaining model accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse attention on tiny token subsets
NPU-based pilot compute hides overhead
Head-wise pipeline and fine-grained sparsity