Dynamic Sparse Attention on Mobile SoCs

📅 2025-08-22

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

In mobile LLM inference, attention operators are often quantization-sensitive and thus offloaded from NPUs to CPUs/GPUs, degrading performance and increasing scheduling complexity. This paper proposes shadowAttn, a dynamic sparse attention mechanism tailored for NPUs. Its core innovation lies in implicitly estimating token importance via NPU-native pointwise computations, coupled with computation-graph binning, head-level NPU–CPU/GPU pipelining, and per-head fine-grained sparsity ratio control—achieving high-accuracy sparse attention with minimal CPU/GPU overhead. Experiments on mainstream mobile NPUs show that shadowAttn matches the inference accuracy and throughput of state-of-the-art frameworks while significantly reducing general-purpose processor resource usage (up to 62% lower CPU/GPU load), thereby enhancing real-time responsiveness and energy efficiency of on-device LLMs.

Technology Category

Application Category

📝 Abstract

On-device running Large Language Models (LLMs) is nowadays a critical enabler towards preserving user privacy. We observe that the attention operator falls back from the special-purpose NPU to the general-purpose CPU/GPU because of quantization sensitivity in state-of-the-art frameworks. This fallback results in a degraded user experience and increased complexity in system scheduling. To this end, this paper presents shadowAttn, a system-algorithm codesigned sparse attention module with minimal reliance on CPU/GPU by only sparsely calculating the attention on a tiny portion of tokens. The key idea is to hide the overhead of estimating the important tokens with a NPU-based pilot compute. Further, shadowAttn proposes insightful techniques such as NPU compute graph bucketing, head-wise NPU-CPU/GPU pipeline and per-head fine-grained sparsity ratio to achieve high accuracy and efficiency. shadowAttn delivers the best performance with highly limited CPU/GPU resource; it requires much less CPU/GPU resource to deliver on-par performance of SoTA frameworks.

Problem

Research questions and friction points this paper is trying to address.

Optimizing attention computation for on-device LLMs

Reducing CPU/GPU dependency in sparse attention modules

Improving efficiency while maintaining model accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse attention on tiny token subsets

NPU-based pilot compute hides overhead

Head-wise pipeline and fine-grained sparsity

🔎 Similar Papers

Toward Attention-based TinyML: A Heterogeneous Accelerated Architecture and Automated Deployment Flow