Boosting Performance of Iterative Applications on GPUs: Kernel Batching with CUDA Graphs

📅 2025-01-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Frequent fine-grained kernel launches on GPUs incur substantial launch overhead, severely limiting performance in scientific computing. To address this, we propose a synergistic optimization combining iterative batching and CUDA Graph unrolling: multiple iterations are grouped into batches and statically unrolled into a single CUDA Graph, thereby eliminating redundant kernel launch overhead. We further introduce the first platform-agnostic criterion for selecting the optimal batch size and develop a generalizable analytical performance model. Evaluated on skeleton applications, our approach achieves over 1.4× speedup. It demonstrates significant and robust performance improvements across real-world iterative GPU applications—including Hotspot, Hotspot3D, and an FDTD-based Maxwell solver—without requiring application-specific tuning. This work establishes a general, analytically tractable, low-overhead execution paradigm for iterative GPU computations.

Technology Category

Application Category

📝 Abstract
Graphics Processing Units (GPUs) have become the standard in accelerating scientific applications on heterogeneous systems. However, as GPUs are getting faster, one potential performance bottleneck with GPU-accelerated applications is the overhead from launching several fine-grained kernels. CUDA Graph addresses these performance challenges by enabling a graph-based execution model that captures operations as nodes and dependence as edges in a static graph. Thereby consolidating several kernel launches into one graph launch. We propose a performance optimization strategy for iteratively launched kernels. By grouping kernel launches into iteration batches and then unrolling these batches into a CUDA Graph, iterative applications can benefit from CUDA Graph for performance boosting. We analyze the performance gain and overhead from this approach by designing a skeleton application. The skeleton application also serves as a generalized example of converting an iterative solver to CUDA Graph, and for deriving a performance model. Using the skeleton application, we show that when unrolling iteration batches for a given platform, there is an optimal size of the iteration batch, which is independent of workload, balancing the extra overhead from graph creation with the performance gain of the graph execution. Depending on workload, we show that the optimal iteration batch size gives more than 1.4x speed-up in the skeleton application. Furthermore, we show that similar speed-up can be gained in Hotspot and Hotspot3D from the Rodinia benchmark suite and a Finite-Difference Time-Domain (FDTD) Maxwell solver.
Problem

Research questions and friction points this paper is trying to address.

GPU
CUDA
performance bottleneck
Innovation

Methods, ideas, or system contributions that make the work stand out.

CUDA Graphs
Optimization
Performance Enhancement
🔎 Similar Papers
No similar papers found.