Taming Offload Overheads in a Massively Parallel Open-Source RISC-V MPSoC: Analysis and Optimization

📅 2025-05-09
🏛️ IEEE Transactions on Parallel and Distributed Systems
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high synchronization and communication overhead induced by fine-grained task offloading in RISC-V heterogeneous multi-core MPSoCs, this paper conducts cycle-accurate quantitative analysis, revealing a nonlinear degradation of overhead with increasing accelerator core count. We propose a hardware-software co-designed offloading optimization framework: (i) introducing, for the first time in a 200+-core NoC, a multicast-capable hardware mechanism to reduce inter-core communication latency; (ii) developing a lightweight offloading runtime library; and (iii) building an offloading performance prediction model with <15% error. Experimental evaluation demonstrates that our approach restores the practical speedup of fine-grained parallel tasks to over 70% of the ideal speedup and achieves up to 2.3× end-to-end performance improvement, significantly enhancing energy efficiency in large-scale accelerator clusters.

Technology Category

Application Category

📝 Abstract
Heterogeneous multi-core architectures combine on a single chip a few large, general-purpose host cores, optimized for single-thread performance, with (many) clusters of small, specialized, energy-efficient accelerator cores for data-parallel processing. Offloading a computation to the many-core acceleration fabric implies synchronization and communication overheads which can hamper overall performance and efficiency, particularly for small and fine-grained parallel tasks. In this work, we present a detailed, cycle-accurate quantitative analysis of the offload overheads on Occamy, an open-source massively parallel RISC-V based heterogeneous MPSoC. We study how the overheads scale with the number of accelerator cores. We explore an approach to drastically reduce these overheads by co-designing the hardware and the offload routines. Notably, we demonstrate that by incorporating multicast capabilities into the Network-on-Chip of a large (200+ cores) accelerator fabric we can improve offloaded application runtimes by as much as 2.3x, restoring more than 70% of the ideally attainable speedups. Finally, we propose a quantitative model to estimate the runtime of selected applications accounting for the offload overheads, with an error consistently below 15%.
Problem

Research questions and friction points this paper is trying to address.

Analyzing offload overheads in RISC-V MPSoC
Reducing overheads via hardware-software co-design
Modeling runtime with offload overheads accurately
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cycle-accurate analysis of offload overheads
Hardware-software co-design reduces overheads
Multicast NoC improves runtime by 2.3x
🔎 Similar Papers
No similar papers found.