🤖 AI Summary
This work addresses the memory-wall bottleneck and the lack of native support for speculative decoding on domestic NPU platforms when running large language models. Focusing on the OpenPangu-7B model, it presents the first efficient speculative decoding inference framework implemented on a domestic NPU. By leveraging hardware-specific characteristics of the NPU, the authors design an end-to-end system-level optimization that effectively mitigates memory bandwidth constraints, substantially improving inference throughput and reducing latency. This study fills a critical technical gap in the domestic NPU ecosystem regarding speculative decoding and provides a practical pathway for the efficient deployment of large language models on such hardware.
📝 Abstract
To mitigate the Memory Wall bottleneck encountered by Large Language Models (LLMs) during inference on \textbf{NPU} hardware, and addressing the scarcity of native support for mainstream speculative decoding algorithms on domestic infrastructure, this study presents an end-to-end speculative inference acceleration scheme for OpenPangu-7B.