Accelerating OpenPangu Inference on NPU via Speculative Decoding

📅 2026-03-03

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

This work addresses the memory-wall bottleneck and the lack of native support for speculative decoding on domestic NPU platforms when running large language models. Focusing on the OpenPangu-7B model, it presents the first efficient speculative decoding inference framework implemented on a domestic NPU. By leveraging hardware-specific characteristics of the NPU, the authors design an end-to-end system-level optimization that effectively mitigates memory bandwidth constraints, substantially improving inference throughput and reducing latency. This study fills a critical technical gap in the domestic NPU ecosystem regarding speculative decoding and provides a practical pathway for the efficient deployment of large language models on such hardware.

Technology Category

Application Category

📝 Abstract

To mitigate the Memory Wall bottleneck encountered by Large Language Models (LLMs) during inference on \textbf{NPU} hardware, and addressing the scarcity of native support for mainstream speculative decoding algorithms on domestic infrastructure, this study presents an end-to-end speculative inference acceleration scheme for OpenPangu-7B.

Problem

Research questions and friction points this paper is trying to address.

Memory Wall

Large Language Models

NPU

speculative decoding

inference acceleration

Innovation

Methods, ideas, or system contributions that make the work stand out.

Speculative Decoding

NPU

Memory Wall