🤖 AI Summary
This work addresses the limited scalability of traditional distributed FFT algorithms on heterogeneous high-performance computing platforms, which stems from static task allocation and global synchronization barriers. To overcome these limitations, the authors propose a task-graph-based dynamic scheduling approach that models each FFT stage as fine-grained tasks operating independently on distributed arrays. Leveraging the Julia programming language, the Dagger runtime, and the DTask task model, the method enables cross-device work stealing and barrier-free execution. Efficient data distribution is supported through pencil/slab-partitioned DArrays. This approach presents the first task-graph-driven distributed FFT implementation in Julia, achieving speedups of up to 2.6× on CPU clusters and 1.35× on GPU clusters. It has been successfully integrated into the Oceananigans.jl fluid simulation framework, significantly enhancing resource utilization and scalability on heterogeneous systems.
📝 Abstract
The Fast Fourier Transform (FFT) is a fundamental numerical technique with widespread application in a range of scientific problems. As scientific simulations attempt to exploit exascale systems, there has been a growing demand for distributed FFT algorithms that can effectively utilize modern heterogeneous high-performance computing (HPC) systems. Conventional FFT algorithms commonly encounter performance bottlenecks, especially when run on heterogeneous platforms. Most distributed FFT approaches rely on static task distribution and require synchronization barriers, limiting scalability and impacting overall resource utilization. In this paper we present DaggerFFT, a distributed FFT framework, developed in Julia, that treats highly parallel FFT computations as a dynamically scheduled task graph. Each FFT stage operates on a separately defined distributed array. FFT operations are expressed as DTasks operating on pencil or slab partitioned DArrays. Each FFT stage owns its own DArray, and the runtime assigns DTasks across devices using Dagger's dynamic scheduler that uses work stealing. We demonstrate how DaggerFFT's dynamic scheduler can outperform state-of-the-art distributed FFT libraries on both CPU and GPU backends, achieving up to a 2.6x speedup on CPU clusters and up to a 1.35x speedup on GPU clusters. We have integrated DaggerFFT into Oceananigans.jl, a geophysical fluid dynamics framework, demonstrating that high-level, task-based runtimes can deliver both superior performance and modularity in large-scale, real-world simulations.