🤖 AI Summary
This work addresses the reliance on global synchronization and collective communication in distributed anisotropic mesh adaptation by proposing a novel decoupled approach that separates mesh generation from performance optimization. The method first processes subdomain boundaries on shared-memory multicore nodes, then distributes subdomains across an HPC cluster for parallel interior mesh generation, while freezing adapted boundaries to ensure consistency. A key innovation is the introduction of a semi-speculative execution model, integrated with a cc-NUMA-aware shared-memory mesh generator, a distributed runtime system, and a boundary-freezing strategy, which collectively eliminate the need for global synchronization. Experimental results demonstrate that the approach efficiently generates high-quality meshes with nearly one billion elements, achieving scalability and performance comparable to state-of-the-art HPC meshing software.
📝 Abstract
This paper presents a distributed memory method for anisotropic mesh adaptation that is designed to avoid the use of collective communication and global synchronization techniques. In the presented method, meshing functionality is separated from performance aspects by utilizing a separate entity for each - a multicore cc-NUMA-based (shared memory) mesh generation software and a parallel runtime system that is designed to help applications leverage the concurrency offered by emerging high-performance computing (HPC) architectures. First, an initial mesh is decomposed and its interface elements (subdomain boundaries) are adapted on a single multicore node (shared memory). Subdomains are then distributed among the nodes of an HPC cluster so that their interior elements are adapted while interface elements (already adapted) remain frozen to maintain mesh conformity. Lessons are presented regarding some re-designs of the shared memory software and how its speculative execution model is utilized by the distributed memory method to achieve good performance. The presented method is shown to generate meshes (of up to approximately 1 billion elements) with comparable quality and performance to existing state-of-the-art HPC meshing software.