🤖 AI Summary
This study addresses the immature compiler support for automatic vectorization on real hardware implementing the RISC-V Vector Extension (RVV 1.0), which limits its performance in scientific computing and machine learning. We present the first systematic evaluation of automatic vectorization capabilities in GCC 15 and LLVM 21 on RVV hardware, combining assembly-level microbenchmarks, perf counter calibration, and comparative experiments between manual and compiler-generated vectorization using the Qsim quantum simulator. Our analysis reveals that key performance bottlenecks—such as predicate overhead and strided memory accesses—are inadequately modeled by current cost models, while default LMUL selection is already near-optimal. Experimental results show GCC 15 outperforms LLVM 21 in four of six proxy applications; LLVM’s advantage in SGEMM/DGEMM stems from aggressive instruction reduction, highlighting both compilers’ insufficient handling of complex memory access patterns.
📝 Abstract
The RISC-V Vector Extension~(RVV) is a cornerstone for supporting compute throughout in scientific and machine learning workloads. Yet compiler support and performance monitoring on real RVV~1.0 hardware are still evolving. In this work, we design a suite of assembly microbenchmarks to establish performance ceilings and calibrate performance counters on RVV hardware. Leveraging the assembly benchmarks, we find that predication overhead and stride load pose performance challenges that current compiler cost models do not yet fully address. Moreover, we present the first evaluation of GCC~15 and LLVM~21 autovectorization in HPC and ML proxy applications. GCC~15 outperforms LLVM~21 in four out of six applications. LLVM~21 only outperforms GCC~15 in SGEMM and DGEMM, driven by more aggressive instruction reduction confirmed through validated \texttt{perf} counters on the RVV hardware. We further show that the default LMUL selection in compilers performs close to the optimal. To study the RVV support for product-level application, we also evaluate the state-vector quantum simulator, Google's Qsim, with both manual RVV intrinsics and compiler auto-vectorization, revealing immaturity in current RVV compiler for complicated memory access pattern.