Topology-Aware Virtualization over Inter-Core Connected Neural Processing Units

📅 2025-06-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Low hardware resource utilization in core-interconnect NPUs (e.g., Graphcore IPU) arises from task–hardware topology mismatch. To address this, we propose vNPU—the first topology-aware virtualization framework for interconnect-centric NPU architectures. vNPU integrates three novel techniques: (1) NPU router virtualization, (2) SRAM/NoC co-memory virtualization, and (3) a best-effort topology mapping algorithm—enabling fine-grained virtualization of the underlying hardware interconnect topology. Implemented on FPGA (Chipyard + FireSim) and validated via the DCRA simulation platform, vNPU incorporates instruction/data-flow redirection, low-overhead address translation, and joint performance–resource optimization. Experimental evaluation across diverse ML workloads demonstrates up to 2× speedup over baseline approaches—including unified virtual memory and MIG—while incurring only 2% hardware overhead.

Technology Category

Application Category

📝 Abstract
With the rapid development of artificial intelligence (AI) applications, an emerging class of AI accelerators, termed Inter-core Connected Neural Processing Units (NPU), has been adopted in both cloud and edge computing environments, like Graphcore IPU, Tenstorrent, etc. Despite their innovative design, these NPUs often demand substantial hardware resources, leading to suboptimal resource utilization due to the imbalance of hardware requirements across various tasks. To address this issue, prior research has explored virtualization techniques for monolithic NPUs, but has neglected inter-core connected NPUs with the hardware topology. This paper introduces vNPU, the first comprehensive virtualization design for inter-core connected NPUs, integrating three novel techniques: (1) NPU route virtualization, which redirects instruction and data flow from virtual NPU cores to physical ones, creating a virtual topology; (2) NPU memory virtualization, designed to minimize translation stalls for SRAM-centric and NoC-equipped NPU cores, thereby maximizing the memory bandwidth; and (3) Best-effort topology mapping, which determines the optimal mapping from all candidate virtual topologies, balancing resource utilization with end-to-end performance. We have developed a prototype of vNPU on both an FPGA platform (Chipyard+FireSim) and a simulator (DCRA). Evaluation results indicate that, compared to other virtualization approaches such as unified virtual memory and MIG, vNPU achieves up to a 2x performance improvement across various ML models, with only 2% hardware cost.
Problem

Research questions and friction points this paper is trying to address.

Optimizing resource utilization in inter-core connected NPUs
Virtualizing hardware topology for AI accelerators efficiently
Balancing performance and hardware costs in NPU virtualization
Innovation

Methods, ideas, or system contributions that make the work stand out.

NPU route virtualization for virtual topology
NPU memory virtualization maximizes bandwidth
Best-effort topology mapping optimizes performance
🔎 Similar Papers
No similar papers found.
D
Dahu Feng
CBICR, Tsinghua University, Beijing, China
Erhu Feng
Erhu Feng
SHANG HAI JIAO TONG UNIVERSITY
MLSYSOperating SystemArchitecture
Dong Du
Dong Du
Associate Professor, Nanjing University of Science and Technology
Computer Graphics3D Computer Vision
P
Pinjie Xu
SenseTime Research, Beijing, China
Yubin Xia
Yubin Xia
Professor, Shanghai Jiao Tong University
Operation SystemVirtualizationComputer ArchitectureSystem Security
H
Haibo Chen
IPADS, Shanghai Jiao Tong University, Shanghai, China
R
Rong Zhao
Center for Brain-Inspired Computing Research, IDG/McGovern Institute for Brain Research and Department of Precision Instrument, Tsinghua University, Beijing, China