MMStencil: Optimizing High-order Stencils on Multicore CPU using Matrix Unit

📅 2025-07-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the limited performance and low hardware utilization of 3D high-order stencil computations on multicore CPUs, this paper proposes a co-optimization methodology targeting matrix units (MUs). Our approach introduces a novel multithreaded parallel paradigm that synergistically integrates SIMD and matrix instructions for computational acceleration; designs a DMA-driven inter-NUMA communication mechanism to alleviate data-sharing bottlenecks in cache-coherent, non-uniform memory architectures; and jointly optimizes memory layout and memory access locality to improve bandwidth utilization. Evaluated on NVIDIA A100 GPUs, our solution achieves up to 2.1× speedup over baseline implementations and delivers a 1.8× improvement over a highly optimized industrial-grade GPU implementation for real-time rendering (RTM) workloads. These results significantly broaden the applicability and scalability of matrix units for complex, high-order stencil computations.

Technology Category

Application Category

📝 Abstract
Matrix-accelerated stencil computation is a hot research topic, yet its application to three-dimensional (3D) high-order stencils and HPC remains underexplored. With the emergence of matrix units on multicore CPUs, we analyze matrix-based acceleration strategies and tailor an optimal approach for 3D high-order stencils. We introduce algorithmic optimizations based on SIMD and matrix units to address strided memory accesses, alignment conflicts, and redundant accesses. We propose memory optimizations to boost on-package memory efficiency, and a novel multi-thread parallelism paradigm to overcome data-sharing challenges caused by the absence of shared data caches. MMStencil sustains consistently high hardware utilization across diverse stencil shapes and dimensions. Our DMA-based inter-NUMA communication further mitigates NUMA effects and MPI limitations in hybrid parallelism. Combining all the innovations, MMStencil outperforms state-of-the-art libraries on Nvidia A100 GPGPU by up to 2.1x. Moreover, the performance improvements translate directly to real-world HPC applications and enable RTM applications to yield 1.8x speedup versus a highly optimized industrial Nvidia A100 GPGPU version.
Problem

Research questions and friction points this paper is trying to address.

Optimizing 3D high-order stencils on multicore CPUs using matrix units
Addressing memory access and alignment issues in stencil computations
Enhancing NUMA and MPI efficiency for hybrid parallel HPC applications
Innovation

Methods, ideas, or system contributions that make the work stand out.

Matrix-based acceleration for 3D high-order stencils
SIMD and matrix unit algorithmic optimizations
DMA-based inter-NUMA communication for hybrid parallelism
🔎 Similar Papers
No similar papers found.
Yinuo Wang
Yinuo Wang
Tsinghua University
LLMReinforcement LearningAutonomous DrivingDiffusion Model
Tianqi Mao
Tianqi Mao
Associate Professor, Beijing Institute of Technology
Integrated Sensing and CommunicationWaveformModulationRydberg Atomic ReceiverTHz Commun.
Lin Gan
Lin Gan
Tsinghua University
W
Wubing Wan
Tsinghua University, Beijing, China
Z
Zeyu Song
Tsinghua University, Beijing, China
J
Jiayu Fu
Tsinghua University, Beijing, China
L
Lanke He
Tsinghua University, Beijing, China
W
Wenqiang Wang
High Performance Computing Department, National Supercomputing Center in Shenzhen, Shenzhen, China
Z
Zekun Yin
School of Software, Shandong University, Jinan 250100, China
W
Wei Xue
Tsinghua University, Beijing, China
Guangwen Yang
Guangwen Yang
Professor of Computer Science and Technology, Tsinghua University