๐ค AI Summary
This work addresses the challenge of efficiently scaling Gaussian processes (GPs) across heterogeneous hardware due to their high computational cost. By extending the GPRat library with the asynchronous multitask runtime system HPX, the study achieves the first portable implementation of GPs on x86-64, ARM, and RISC-V architectures and provides a systematic performance evaluation. Strong-scaling and problem-size scaling experiments reveal that the ARM A64FX processor outperforms the x86-64 Zen 2 architecture under full load, while the RISC-V SG2042 exhibits up to 14ร slower single-core performance and up to 25ร slower parallel execution, highlighting critical bottlenecks in its vector register file and memory subsystem. This research establishes a foundational benchmark and identifies key optimization directions for cross-architecture deployment of Gaussian processes.
๐ Abstract
Gaussian processes are widely used in machine learning domains but remain computationally demanding, limiting their efficient scalability across diverse hardware platforms. The GPRat library targets these challenges with the help of the asynchronous many-task runtime system HPX. In this work, we extend GPRat to enable portability across multiple hardware architectures and evaluate its performance on representative x86-64, ARM, and RISC-V chips. We conduct node-level strong-scaling and problem-size-scaling benchmarks for Gaussian Process prediction and hyperparameter optimization to assess single-core performance, parallel scalability, and architectural efficiency.
Our results show that while the x86-64 Zen 2 chip achieves a 58% single-core performance advantage over the ARM-based Fujitsu A64FX, superior parallel scaling allows the 48-core ARM chip to outperform the 64-core Zen 2 by 9% at full node utilization. The evaluated SOPHON SG2042 RISC-V chip exhibits substantially lower performance and weaker scalability, with single-core performance lagging by up to a factor of 14 and large-scale parallel workloads showing slowdowns of up to a factor of 25. For problem-size scaling, ARM and x86-64 systems demonstrate comparable performance within 25%. These findings highlight the growing competitiveness of ARM-based processors and emphasize the importance of wide-register vectorization support and memory subsystem improvements for upcoming RISC-V platforms.