🤖 AI Summary
Predicting OpenMP workload performance on heterogeneous embedded SoCs is challenging due to strong couplings among task DAG structure, irregular control flow, cache/branch behavior, and thermal dynamics. To address this, we propose the first heterogeneous graph neural network (HGNN) surrogate model that jointly encodes task graph topology, CFG semantics, and real-time hardware states—including DVFS settings, temperature, and core utilization. We unify these three heterogeneous information sources into a typed-edge heterogeneous graph and introduce a multi-task evidential learning head with Normal-Inverse-Gamma distribution for calibrated uncertainty quantification and risk-aware prediction. Evaluated on Jetson TX2, Orin NX, and RUBIK Pi platforms, our model achieves R² > 0.95 and expected calibration error (ECE) < 0.05. Integrated into the MAMBRL-D3QN scheduler, it reduces makespan by 66% and energy consumption by 82%, significantly outperforming model-agnostic baselines.
📝 Abstract
Performance prediction for OpenMP workloads on heterogeneous embedded SoCs is challenging due to complex interactions between task DAG structure, control-flow irregularity, cache
and branch behavior, and thermal dynamics; classical heuristics struggle under workload irregularity, tabular regressors discard structural information, and model-free RL risks
overheating resource-constrained devices. We introduce GraphPerf-RT, the first surrogate that unifies task DAG topology, CFG-derived code semantics, and runtime context (per-core
DVFS, thermal state, utilization) in a heterogeneous graph representation with typed edges encoding precedence, placement, and contention. Multi-task evidential heads predict
makespan, energy, cache and branch misses, and utilization with calibrated uncertainty (Normal-Inverse-Gamma), enabling risk-aware scheduling that filters low-confidence rollouts.
We validate GraphPerf-RT on three embedded ARM platforms (Jetson TX2, Jetson Orin NX, RUBIK Pi), achieving R^2 > 0.95 with well-calibrated uncertainty (ECE < 0.05). To
demonstrate end-to-end scheduling utility, we integrate the surrogate with four RL methods on Jetson TX2: single-agent model-free (SAMFRL), single-agent model-based (SAMBRL),
multi-agent model-free (MAMFRL-D3QN), and multi-agent model-based (MAMBRL-D3QN). Experiments across 5 seeds (200 episodes each) show that MAMBRL-D3QN with GraphPerf-RT as the
world model achieves 66% makespan reduction (0.97 +/- 0.35s) and 82% energy reduction (0.006 +/- 0.005J) compared to model-free baselines, demonstrating that accurate,
uncertainty-aware surrogates enable effective model-based planning on thermally constrained embedded systems.