Record-Remix-Replay: Hierarchical GPU Kernel Optimization using Evolutionary Search

πŸ“… 2026-04-13
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

225K/year
πŸ€– AI Summary
This work addresses the challenge of efficiently optimizing GPU kernels across generations of hardware by proposing a hierarchical auto-tuning framework that, for the first time, enables end-to-end, full-stack optimization spanning source code implementation, compiler pass sequences, and runtime parameters. The framework integrates large language model–guided evolutionary search, Bayesian optimization, and record-replay compilation techniques to unify multiple optimization dimensions traditionally handled in isolation. Experimental results demonstrate that this approach significantly outperforms existing methods that tune only kernel parameters or compiler flags when applied to complete scientific applications, achieving nearly an order-of-magnitude speedup over current evolutionary search strategies.

Technology Category

Application Category

πŸ“ Abstract
As high-performance computing and AI workloads become increasingly dependent on GPUs, maintaining high performance across rapidly evolving hardware generations has become a major challenge. Developers often spend months tuning scientific applications to fully exploit new architectures, navigating a complex optimization space that spans algorithm design, source implementation, compiler flags and pass sequences, and kernel launch parameters. Existing approaches can effectively search parts of this space in isolation, such as launch configurations or compiler settings, but optimizing across the full space still requires substantial human expertise and iterative manual effort. In this paper, we present Record-Remix-Replay (R^3), a hierarchical optimization framework that combines LLM-driven evolutionary search, Bayesian optimization, and record-replay compilation techniques to efficiently explore GPU kernel optimizations from source-level implementation choices down to compiler pass ordering and runtime configuration. By making candidate evaluation fast and scalable, our approach enables practical end-to-end search over optimization dimensions that are typically treated separately. We show that Record-Remix-Replay can optimize full scientific applications better than traditional approaches over kernel parameters and compiler flags, while also being nearly an order of magnitude faster than modern evolutionary search approaches.
Problem

Research questions and friction points this paper is trying to address.

GPU kernel optimization
optimization space
high-performance computing
AI workloads
compiler optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

hierarchical optimization
evolutionary search
LLM-driven optimization
record-replay compilation
GPU kernel tuning
πŸ”Ž Similar Papers
No similar papers found.