Code Generation for Near-Roofline Finite Element Actions on GPUs from Symbolic Variational Forms

📅 2025-06-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of efficiently parallelizing finite element variational forms—expressed in the Unified Form Language (UFL)—on GPUs. We propose the first UFL-specific GPU scheduling space modeling and cost-driven automated search framework. Our method integrates UFL symbolic parsing, a low-overhead heuristic cost model, state-aware pruning, and automatic CUDA code generation, and is deeply embedded within the Firedrake framework to support architecture-specific optimizations for Volta and Kepler GPUs. The core contribution lies in explicitly modeling latency hiding and the trade-off between register usage and occupancy, thereby overcoming longstanding scheduling bottlenecks for FEM operators on heterogeneous platforms. Experimental evaluation on Titan V and K40c shows that 65% of test cases—including fluid dynamics, wave propagation, and structural mechanics operators—achieve over 50% of roofline performance, significantly outperforming existing baselines.

Technology Category

Application Category

📝 Abstract
We present a novel parallelization strategy for evaluating Finite Element Method (FEM) variational forms on GPUs, focusing on those that are expressible through the Unified Form Language (UFL) on simplex meshes. We base our approach on code transformations, wherein we construct a space of scheduling candidates and rank them via a heuristic cost model to effectively handle the large diversity of computational workloads that can be expressed in this way. We present a design of a search space to which the cost model is applied, along with an associated pruning strategy to limit the number of configurations that need to be empirically evaluated. The goal of our design is to strike a balance between the device's latency-hiding capabilities and the amount of state space, a key factor in attaining near-roofline performance. To make our work widely available, we have prototyped our parallelization strategy within the extsc{Firedrake} framework, a UFL-based FEM solver. We evaluate the performance of our parallelization scheme on two generations of Nvidia GPUs, specifically the Titan V (Volta architecture) and Tesla K40c (Kepler architecture), across a range of operators commonly used in applications, including fluid dynamics, wave propagation, and structural mechanics, in 2D and 3D geometries. Our results demonstrate that our proposed algorithm achieves more than $50%$ roofline performance in $65%$ of the test cases on both devices.
Problem

Research questions and friction points this paper is trying to address.

Optimize GPU code generation for FEM variational forms
Balance latency-hiding and state space for performance
Achieve near-roofline performance on diverse computational workloads
Innovation

Methods, ideas, or system contributions that make the work stand out.

Code transformations for FEM variational forms
Heuristic cost model for scheduling candidates
Integration with Firedrake framework for GPUs
🔎 Similar Papers
No similar papers found.