MMStencil: Optimizing High-order Stencils on Multicore CPU using Matrix Unit

📅 2025-07-15

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

To address the limited performance and low hardware utilization of 3D high-order stencil computations on multicore CPUs, this paper proposes a co-optimization methodology targeting matrix units (MUs). Our approach introduces a novel multithreaded parallel paradigm that synergistically integrates SIMD and matrix instructions for computational acceleration; designs a DMA-driven inter-NUMA communication mechanism to alleviate data-sharing bottlenecks in cache-coherent, non-uniform memory architectures; and jointly optimizes memory layout and memory access locality to improve bandwidth utilization. Evaluated on NVIDIA A100 GPUs, our solution achieves up to 2.1× speedup over baseline implementations and delivers a 1.8× improvement over a highly optimized industrial-grade GPU implementation for real-time rendering (RTM) workloads. These results significantly broaden the applicability and scalability of matrix units for complex, high-order stencil computations.

Technology Category

Application Category

📝 Abstract

Matrix-accelerated stencil computation is a hot research topic, yet its application to three-dimensional (3D) high-order stencils and HPC remains underexplored. With the emergence of matrix units on multicore CPUs, we analyze matrix-based acceleration strategies and tailor an optimal approach for 3D high-order stencils. We introduce algorithmic optimizations based on SIMD and matrix units to address strided memory accesses, alignment conflicts, and redundant accesses. We propose memory optimizations to boost on-package memory efficiency, and a novel multi-thread parallelism paradigm to overcome data-sharing challenges caused by the absence of shared data caches. MMStencil sustains consistently high hardware utilization across diverse stencil shapes and dimensions. Our DMA-based inter-NUMA communication further mitigates NUMA effects and MPI limitations in hybrid parallelism. Combining all the innovations, MMStencil outperforms state-of-the-art libraries on Nvidia A100 GPGPU by up to 2.1x. Moreover, the performance improvements translate directly to real-world HPC applications and enable RTM applications to yield 1.8x speedup versus a highly optimized industrial Nvidia A100 GPGPU version.

Problem

Research questions and friction points this paper is trying to address.

Optimizing 3D high-order stencils on multicore CPUs using matrix units

Addressing memory access and alignment issues in stencil computations

Enhancing NUMA and MPI efficiency for hybrid parallel HPC applications

Innovation

Methods, ideas, or system contributions that make the work stand out.

Matrix-based acceleration for 3D high-order stencils

SIMD and matrix unit algorithmic optimizations

DMA-based inter-NUMA communication for hybrid parallelism

🔎 Similar Papers

No similar papers found.