🤖 AI Summary
In mobile LLM inference, attention operators are often quantization-sensitive and thus offloaded from NPUs to CPUs/GPUs, degrading performance and increasing scheduling complexity. This paper proposes shadowAttn, a dynamic sparse attention mechanism tailored for NPUs. Its core innovation lies in implicitly estimating token importance via NPU-native pointwise computations, coupled with computation-graph binning, head-level NPU–CPU/GPU pipelining, and per-head fine-grained sparsity ratio control—achieving high-accuracy sparse attention with minimal CPU/GPU overhead. Experiments on mainstream mobile NPUs show that shadowAttn matches the inference accuracy and throughput of state-of-the-art frameworks while significantly reducing general-purpose processor resource usage (up to 62% lower CPU/GPU load), thereby enhancing real-time responsiveness and energy efficiency of on-device LLMs.
📝 Abstract
On-device running Large Language Models (LLMs) is nowadays a critical enabler towards preserving user privacy. We observe that the attention operator falls back from the special-purpose NPU to the general-purpose CPU/GPU because of quantization sensitivity in state-of-the-art frameworks. This fallback results in a degraded user experience and increased complexity in system scheduling. To this end, this paper presents shadowAttn, a system-algorithm codesigned sparse attention module with minimal reliance on CPU/GPU by only sparsely calculating the attention on a tiny portion of tokens. The key idea is to hide the overhead of estimating the important tokens with a NPU-based pilot compute. Further, shadowAttn proposes insightful techniques such as NPU compute graph bucketing, head-wise NPU-CPU/GPU pipeline and per-head fine-grained sparsity ratio to achieve high accuracy and efficiency. shadowAttn delivers the best performance with highly limited CPU/GPU resource; it requires much less CPU/GPU resource to deliver on-par performance of SoTA frameworks.