Mass Matrix Assembly on Tensor Cores for Implicit Particle-In-Cell Methods

πŸ“… 2026-04-21
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

212K/year
πŸ€– AI Summary
This work addresses the performance bottleneck in implicit particle-in-cell methods, where mass matrix assembly is constrained by reduction operations that poorly utilize modern GPU tensor cores. We present the first formulation that exactly recasts this assembly into a tensor-core-friendly matrix-matrix multiplication (GEMM) form. By integrating high-order B-spline interpolation, particle batching, and support-domain grouping, our approach supports arbitrary interpolation orders and both scalar and tensor-valued block mass matrices. Leveraging NVIDIA’s tensor-core-based matrix multiply-accumulate (MMA) units, the method achieves up to 3Γ— kernel-level speedup over an optimized baseline while maintaining full generality. In end-to-end ECSIM simulations, this translates to a 15% overall acceleration.

Technology Category

Application Category

πŸ“ Abstract
Matrix-multiply-accumulate (MMA) units, or tensor cores, are now widespread across modern computing architectures. Yet, their use for particle-grid operators remains limited. In implicit particle methods, mass-matrix assembly is a reduction-dominated kernel in which weighted outer products of interpolation weights are accumulated over particle support. We show that this operation can be reformulated exactly, cell by cell, as a sequence of matrix products matched to hardware MMA tiles. The formulation is general with respect to interpolation order and hardware platform, and applies to both scalar mass matrices and the tensorial block mass matrix arising in implicit in the Energy-Conserving Semi-Implicit Method (ECSIM) for Particle-in-Cell simulations. We introduce particle batching and a support-group decomposition for higher-order shape functions whose stencil extends beyond a single cell, specialize the method to first- and second-order B-spline interpolation, and implement it on NVIDIA tensor cores. The resulting kernels achieve up to 3x over optimized conventional implementations and reduce end-to-end ECSIM runtime by 15%.
Problem

Research questions and friction points this paper is trying to address.

mass matrix assembly
tensor cores
implicit Particle-in-Cell
matrix-multiply-accumulate
particle-grid operators
Innovation

Methods, ideas, or system contributions that make the work stand out.

tensor cores
mass matrix assembly
Particle-in-Cell
matrix-multiply-accumulate
ECSIM
πŸ”Ž Similar Papers
No similar papers found.