TurboFNO: High-Performance Fourier Neural Operator with Fused FFT-GEMM-iFFT on GPU

📅 2025-04-16
📈 Citations: 0
Influential: 0
📄 PDF

career value

223K/year
🤖 AI Summary
Fourier Neural Operators (FNOs) suffer from high memory bandwidth pressure and frequent kernel launch overhead on GPUs due to the sequential, non-fused execution of FFT–GEMM–iFFT stages in PDE solving. Method: This work introduces the first end-to-end fused GPU kernel for FNOs, featuring an architecture-aware, fully integrated FFT–GEMM–iFFT computational flow. It incorporates high-frequency truncation, zero-padding, and channel pruning, while achieving 100% shared memory bank utilization via swizzling and thread-block remapping. Custom high-performance FFT and GEMM micro-kernels are designed with optimized global memory access patterns. Results: On NVIDIA A100, our implementation delivers up to 150× speedup over PyTorch with cuBLAS and cuFFT, significantly reducing global memory traffic and kernel launch overhead.

Technology Category

Application Category

📝 Abstract
Fourier Neural Operators (FNO) are widely used for learning partial differential equation solution operators. However, FNO lacks architecture-aware optimizations,with its Fourier layers executing FFT, filtering, GEMM, zero padding, and iFFT as separate stages, incurring multiple kernel launches and significant global memory traffic. We propose TurboFNO, the first fully fused FFT-GEMM-iFFT GPU kernel with built-in FFT optimizations. We first develop FFT and GEMM kernels from scratch, achieving performance comparable to or faster than the closed-source SOTA cuBLAS and cuFFT. Additionally, our FFT kernel integrates a built-in high-frequency truncation, input zero-padding, and pruning feature to avoid additional memory copy kernels. To fuse the FFT and GEMM workloads, we propose an FFT variant in which a single thread block iterates over the hidden dimension, aligning with the $k$-loop in GEMM. Additionally, we design two shared memory swizzling patterns to achieve 100% memory bank utilization when forwarding FFT output to GEMM and enabling the iFFT to retrieve GEMM results directly from shared memory.Experimental result on an NVIDIA A100 GPU shows TurboFNO outperforms PyTorch, cuBLAS, and cuFFT by up to 150%.
Problem

Research questions and friction points this paper is trying to address.

Optimize Fourier Neural Operator for GPU performance
Fuse FFT-GEMM-iFFT operations to reduce kernel launches
Minimize global memory traffic in FNO computations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fully fused FFT-GEMM-iFFT GPU kernel
Built-in FFT optimizations and truncation
Shared memory swizzling for 100% utilization