Taming Offload Overheads in a Massively Parallel Open-Source RISC-V MPSoC: Analysis and Optimization

๐Ÿ“… 2025-05-09
๐Ÿ›๏ธ IEEE Transactions on Parallel and Distributed Systems
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

258K/year
๐Ÿค– AI Summary
To address the high synchronization and communication overhead induced by fine-grained task offloading in RISC-V heterogeneous multi-core MPSoCs, this paper conducts cycle-accurate quantitative analysis, revealing a nonlinear degradation of overhead with increasing accelerator core count. We propose a hardware-software co-designed offloading optimization framework: (i) introducing, for the first time in a 200+-core NoC, a multicast-capable hardware mechanism to reduce inter-core communication latency; (ii) developing a lightweight offloading runtime library; and (iii) building an offloading performance prediction model with <15% error. Experimental evaluation demonstrates that our approach restores the practical speedup of fine-grained parallel tasks to over 70% of the ideal speedup and achieves up to 2.3ร— end-to-end performance improvement, significantly enhancing energy efficiency in large-scale accelerator clusters.

Technology Category

Application Category

๐Ÿ“ Abstract
Heterogeneous multi-core architectures combine on a single chip a few large, general-purpose host cores, optimized for single-thread performance, with (many) clusters of small, specialized, energy-efficient accelerator cores for data-parallel processing. Offloading a computation to the many-core acceleration fabric implies synchronization and communication overheads which can hamper overall performance and efficiency, particularly for small and fine-grained parallel tasks. In this work, we present a detailed, cycle-accurate quantitative analysis of the offload overheads on Occamy, an open-source massively parallel RISC-V based heterogeneous MPSoC. We study how the overheads scale with the number of accelerator cores. We explore an approach to drastically reduce these overheads by co-designing the hardware and the offload routines. Notably, we demonstrate that by incorporating multicast capabilities into the Network-on-Chip of a large (200+ cores) accelerator fabric we can improve offloaded application runtimes by as much as 2.3x, restoring more than 70% of the ideally attainable speedups. Finally, we propose a quantitative model to estimate the runtime of selected applications accounting for the offload overheads, with an error consistently below 15%.
Problem

Research questions and friction points this paper is trying to address.

Analyzing offload overheads in RISC-V MPSoC
Reducing overheads via hardware-software co-design
Modeling runtime with offload overheads accurately
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cycle-accurate analysis of offload overheads
Hardware-software co-design reduces overheads
Multicast NoC improves runtime by 2.3x
๐Ÿ”Ž Similar Papers
No similar papers found.