🤖 AI Summary
To address load imbalance and poor convergence robustness in high-dimensional adaptive numerical integration on multi-GPU systems, this paper proposes a decentralized distributed algorithm. The method employs hierarchical domain decomposition and local error-driven recursive subdivision, enabling independent adaptive partitioning on each GPU. A cyclic polling-based dynamic load redistribution mechanism is designed, leveraging non-blocking CUDA-aware MPI for low-overhead inter-GPU communication—without requiring global synchronization or centralized scheduling. Experiments on typical 10–50 dimensional integral problems demonstrate that the proposed approach achieves 1.8–3.2× higher computational efficiency compared to state-of-the-art GPU-accelerated integration libraries (e.g., Cuba-GPU, GpuQUAD). Moreover, it exhibits significantly enhanced robustness against degradation in integrand regularity and variations in target accuracy.
📝 Abstract
We introduce a distributed adaptive quadrature method that formulates multidimensional integration as a hierarchical domain decomposition problem on multi-GPU architectures. The integration domain is recursively partitioned into subdomains whose refinement is guided by local error estimators. Each subdomain evolves independently on a GPU, which exposes a significant load imbalance as the adaptive process progresses. To address this challenge, we introduce a decentralised load redistribution schemes based on a cyclic round-robin policy. This strategy dynamically rebalance subdomains across devices through non-blocking, CUDA-aware MPI communication that overlaps with computation. The proposed strategy has two main advantages compared to a state-of-the-art GPU-tailored package: higher efficiency in high dimensions; and improved robustness w.r.t the integrand regularity and the target accuracy.