Accelerating OpenPangu Inference on NPU via Speculative Decoding

📅 2026-03-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the memory-wall bottleneck and the lack of native support for speculative decoding on domestic NPU platforms when running large language models. Focusing on the OpenPangu-7B model, it presents the first efficient speculative decoding inference framework implemented on a domestic NPU. By leveraging hardware-specific characteristics of the NPU, the authors design an end-to-end system-level optimization that effectively mitigates memory bandwidth constraints, substantially improving inference throughput and reducing latency. This study fills a critical technical gap in the domestic NPU ecosystem regarding speculative decoding and provides a practical pathway for the efficient deployment of large language models on such hardware.

Technology Category

Application Category

📝 Abstract
To mitigate the Memory Wall bottleneck encountered by Large Language Models (LLMs) during inference on \textbf{NPU} hardware, and addressing the scarcity of native support for mainstream speculative decoding algorithms on domestic infrastructure, this study presents an end-to-end speculative inference acceleration scheme for OpenPangu-7B.
Problem

Research questions and friction points this paper is trying to address.

Memory Wall
Large Language Models
NPU
speculative decoding
inference acceleration
Innovation

Methods, ideas, or system contributions that make the work stand out.

Speculative Decoding
NPU
Memory Wall
LLM Inference
OpenPangu
🔎 Similar Papers
No similar papers found.
Y
Yuntao Dai
School of Software Engineering, University of Science and Technology of China
Jing Wu
Jing Wu
School of Computer Science & Informatics, Cardiff University
3D reconstructionpattern recognitionvisual analytics
H
Hang Gu
School of Software Engineering, University of Science and Technology of China
Teng Wang
Teng Wang
University of Science and Technology of China
AcceleratorFPGAArchitecture