🤖 AI Summary
To address the limited performance and low hardware utilization of 3D high-order stencil computations on multicore CPUs, this paper proposes a co-optimization methodology targeting matrix units (MUs). Our approach introduces a novel multithreaded parallel paradigm that synergistically integrates SIMD and matrix instructions for computational acceleration; designs a DMA-driven inter-NUMA communication mechanism to alleviate data-sharing bottlenecks in cache-coherent, non-uniform memory architectures; and jointly optimizes memory layout and memory access locality to improve bandwidth utilization. Evaluated on NVIDIA A100 GPUs, our solution achieves up to 2.1× speedup over baseline implementations and delivers a 1.8× improvement over a highly optimized industrial-grade GPU implementation for real-time rendering (RTM) workloads. These results significantly broaden the applicability and scalability of matrix units for complex, high-order stencil computations.
📝 Abstract
Matrix-accelerated stencil computation is a hot research topic, yet its application to three-dimensional (3D) high-order stencils and HPC remains underexplored. With the emergence of matrix units on multicore CPUs, we analyze matrix-based acceleration strategies and tailor an optimal approach for 3D high-order stencils. We introduce algorithmic optimizations based on SIMD and matrix units to address strided memory accesses, alignment conflicts, and redundant accesses. We propose memory optimizations to boost on-package memory efficiency, and a novel multi-thread parallelism paradigm to overcome data-sharing challenges caused by the absence of shared data caches. MMStencil sustains consistently high hardware utilization across diverse stencil shapes and dimensions. Our DMA-based inter-NUMA communication further mitigates NUMA effects and MPI limitations in hybrid parallelism. Combining all the innovations, MMStencil outperforms state-of-the-art libraries on Nvidia A100 GPGPU by up to 2.1x. Moreover, the performance improvements translate directly to real-world HPC applications and enable RTM applications to yield 1.8x speedup versus a highly optimized industrial Nvidia A100 GPGPU version.