🤖 AI Summary
In modern architectures, memory bandwidth bottlenecks are increasingly severe, and the M2L (multipole-to-local) operation has become a critical performance limiter for kernel-independent fast multipole methods (kiFMM).
Method: This paper proposes a high-compute-intensity, BLAS-based M2L optimization: reformulating the M2L operator as a Level-3 BLAS computation, augmented with matrix compression, hand-tuned assembly kernels, and cross-architecture (ARM/x86) code generation.
Contribution/Results: To our knowledge, this is the first approach that achieves measured performance on par with highly optimized FFT-based M2L—while preserving kernel independence—and simultaneously reduces memory bandwidth pressure significantly. Key innovations include: (1) unifying M2L as a high-operational-intensity BLAS-3 problem; (2) enabling flexible, fair comparison between FFT- and BLAS-based paradigms; and (3) delivering both throughput gains and architectural portability on multicore CPUs.
📝 Abstract
Algorithm design must focus on minimising data movement even at the cost of more FLOPs due to the growing disparity between FLOP availability and memory bandwidth on modern architectures. We review the requirements for the Multipole to Local (M2L) operation, a sub-routine of the Kernel Independent Fast Multipole Method (kiFMM) algorithm. The kiFMM is a variant of the popular Fast Multipole Method (FMM), which accelerates the evaluation of N-body potential problems. Naively implemented, the M2L can lead to bandwidth pressure, and is therefore a key bottleneck in an FMMs. Recent software packages for the kiFMM have relied on the Fast Fourier Transform (FFT) to accelerate M2L as it can be formulated as a convolution type operation. However, parallelly developed 'black box' FMMs formulate the M2L as a BLAS operation and use direct matrix compression techniques for further acceleration. The FFT approach requires careful implementation to overcome the low operational intensity of the element-wise product inherent in its formulation, whereas the BLAS approach provides a high operational intensity formulation if the M2L is written in terms of level 3 BLAS operations. We describe algorithmic simplifications for the BLAS-based M2L operation, and show that the BLAS version of the M2L can be competitive in practice with the Fast Fourier Transform (FFT) version. We have developed a carefully optimised software implementation that allows us to flexibly switch between M2L approaches and is optimised for ARM and x86 targets, allowing for a fair comparison between both.