Compiler-supported reduced precision and AoS-SoA transformations for heterogeneous hardware

📅 2025-12-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the bandwidth-bound data layout and precision co-optimization problem for Lagrangian particle simulations on heterogeneous GPU platforms. Methodologically, it (i) systematically compares AoS versus SoA memory layouts under SIMT execution semantics; (ii) identifies the optimal placement of sub-IEEE-precision conversion—either CPU-side pre-conversion or GPU-side in-place logical conversion; and (iii) proposes a compiler-annotated co-optimization framework enabling programmers to explicitly control the timing and location of data-format transformations, integrated with heterogeneous memory sharing for efficient GPU offloading. Evaluated on NVIDIA G200, the approach achieves 2.6× speedup over baseline implementations. On AMD MI300A, it demonstrates superior cross-platform stability and scalability, confirming its architectural portability and consistent performance gains across multiple GPU generations.

Technology Category

Application Category

📝 Abstract
This study evaluates AoS-to-SoA transformations over reduced-precision data layouts for a particle simulation code on several GPU platforms: We hypothesize that SoA fits particularly well to SIMT, while AoS is the preferred storage format for many Lagrangian codes. Reduced-precision (below IEEE accuracy) is an established tool to address bandwidth constraints, although it remains unclear whether AoS and precision conversions should execute on a CPU or be deployed to a GPU if the compute kernel itself must run on an accelerator. On modern superchips where CPUs and GPUs share (logically) one data space, it is also unclear whether it is advantageous to stream data to the accelerator prior to the calculation, or whether we should let the accelerator transform data on demand, i.e.~work in-place logically. We therefore introduce compiler annotations to facilitate such conversions and to give the programmer the option to orchestrate the conversions in combination with GPU offloading. For some of our compute kernels of interest, Nvidia's G200 platforms yield a speedup of around 2.6 while AMD's MI300A exhibits more robust performance yet profits less. We assume that our compiler-based techniques are applicable to a wide variety of Lagrangian codes and beyond.
Problem

Research questions and friction points this paper is trying to address.

Optimizing data layout transformations for GPU performance
Determining efficient precision reduction and conversion strategies
Enabling compiler-supported orchestration of data transformations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Compiler annotations enable AoS-to-SoA transformations
Reduced precision data layouts address bandwidth constraints
Orchestrate conversions with GPU offloading for performance
🔎 Similar Papers
No similar papers found.
P
Pawel K. Radtke
Department of Computer Science, Institute for Data Science, Durham University, United Kingdom
Tobias Weinzierl
Tobias Weinzierl
Durham University
Scientific ComputingParallel AlgorithmsHigh Performance Computing