🤖 AI Summary
To address the demand for large-scale PDE simulations in seismic and medical imaging, this paper proposes a fully automated high-performance code generation method tailored to explicit finite-difference (FD) stencils. The approach deeply integrates MPI-X (including UCX and shared memory) distributed parallel code generation into the Devito domain-specific language (DSL) compilation pipeline—enabling end-to-end automation from symbolic modeling to HPC-ready code without source-code modifications, and supporting scalable CPU/GPU cross-platform execution. Key techniques include symbolic differentiation, loop optimization, communication–computation overlap, and GPU offloading. Experiments on multi-node CPU/GPU clusters demonstrate excellent strong and weak scaling, substantial reduction in execution time, and over 70% decrease in developer effort for coding and performance tuning. The framework has been successfully deployed in production-scale scientific computing tasks, including real-world seismic full-waveform inversion.
📝 Abstract
Partial differential equations (PDEs) are crucial in modeling diverse phenomena across scientific disciplines, including seismic and medical imaging, computational fluid dynamics, image processing, and neural networks. Solving these PDEs at scale is an intricate and time-intensive process that demands careful tuning. This paper introduces automated code-generation techniques specifically tailored for distributed memory parallelism (DMP) to execute explicit finite-difference (FD) stencils at scale, a fundamental challenge in numerous scientific applications. These techniques are implemented and integrated into the Devito DSL and compiler framework, a well-established solution for automating the generation of FD solvers based on a high-level symbolic math input. Users benefit from modeling simulations for real-world applications at a high-level symbolic abstraction and effortlessly harnessing HPC-ready distributed-memory parallelism without altering their source code. This results in drastic reductions both in execution time and developer effort. A comprehensive performance evaluation of Devito's DMP via MPI demonstrates highly competitive strong and weak scaling on CPU and GPU clusters, proving its effectiveness and capability to meet the demands of large-scale scientific simulations.