π€ AI Summary
In high-order spectral finite element methods (HOSFEM), matrixβvector multiplication performance is bottlenecked by geometric factor data: large memory footprint with minimal computational contribution, severely limiting tensor contraction optimization under conventional precomputation strategies. This work introduces an online lightweight recomputation mechanism for trilinear elements, pioneering a low-overhead real-time recomputation paradigm that synergistically integrates scalar fusion, partial recomputation, and CUDA constant memory. This approach overcomes precomputation constraints and reshapes the performance roofline. Coupled with Tensor Core acceleration and integration into the Nekbone benchmark, it achieves 1.74Γβ4.10Γ speedup in core computation and 1.12Γβ1.40Γ end-to-end performance improvement on NVIDIA A100 and DCU K100 GPUs, reaching 85%β100% of the new roofline limit.
π Abstract
The high-order/spectral finite element method (HOSFEM) is a widely used numerical method for solving PDEs, with its performance primarily relying on axhelm, a matrix-free kernel for element-local matrix-vector multiplications. In axhelm, geometric factors account for over half of memory access but minimally contribute to computational workload. This imbalance significantly constrains the performance roofline, indicating that further optimization of tensor contraction, the core computation in axhelm, yields only minimal improvements. To overcome this bottleneck, we propose a low-cost on-the-fly recalculation of geometric factors for trilinear elements, thereby unlocking substantial potential for optimizing tensor contraction. The proposed approach is implemented in Nekbone, a standard HOSFEM benchmark. With optimizations such as merging scalar factors, partial recalculation, Tensor Core acceleration, and constant memory utilization, performance reaches 85%-100% of the higher roofline. The optimized kernels achieve speedups of 1.74x-4.10x on NVIDIA A100 and 1.99x-3.77x on DCU K100. This leads to a 1.12x-1.40x speedup for Nekbone.