A Two-Stage GPU Kernel Tuner Combining Semantic Refactoring and Search-Based Optimization

📅 2026-01-19

📈 Citations: 1

✨ Influential: 0

career value

223K/year

🤖 AI Summary

GPU code optimization remains a critical bottleneck in high-performance computing and large model training and inference, as existing approaches struggle to consistently approach hardware performance limits. This work proposes a two-stage GPU kernel tuner: it first transforms the original kernel into a parameterized template through semantic restructuring, then optimizes the template parameters using a performance-feedback-driven constrained search strategy. By integrating an LLM-agent-guided iterative workflow with a synergistic mechanism of templated rewriting and search-based tuning, the method significantly enhances optimization stability and interpretability while reducing manual intervention. Evaluated on real-world CUDA kernels, it achieves up to 3× speedup over baseline implementations, outperforms pure LLM-based rewriting approaches, and demonstrates strong potential for extension to other backends such as OpenCL and HIP.

Technology Category

Application Category

📝 Abstract

GPU code optimization is a key performance bottleneck for HPC workloads as well as large-model training and inference. Although compiler optimizations and hand-written kernels can partially alleviate this issue, achieving near-hardware-limit performance still relies heavily on manual code refactoring and parameter tuning. Recent progress in LLM-agent-based kernel generation and optimization has been reported, yet many approaches primarily focus on direct code rewriting, where parameter choices are often implicit and hard to control, or require human intervention, leading to unstable performance gains. This paper introduces a template-based rewriting layer on top of an agent-driven iterative loop: kernels are semantically refactored into explicitly parameterizable templates, and template parameters are then optimized via search-based autotuning, yielding more stable and higher-quality speedups. Experiments on a set of real-world kernels demonstrate speedups exceeding 3x in the best case. We extract representative CUDA kernels from SGLang as evaluation targets; the proposed agentic tuner iteratively performs templating, testing, analysis, and planning, and leverages profiling feedback to execute constrained parameter search under hardware resource limits. Compared to agent-only direct rewriting, the template-plus-search design significantly reduces the randomness of iterative optimization, making the process more interpretable and enabling a more systematic approach toward high-performance configurations. The proposed method can be further extended to OpenCL, HIP, and other backends to deliver automated performance optimization for real production workloads.

Problem

Research questions and friction points this paper is trying to address.

GPU kernel optimization

performance bottleneck

parameter tuning

LLM-agent-based optimization

code refactoring

Innovation

Methods, ideas, or system contributions that make the work stand out.

semantic refactoring

search-based autotuning

template-based kernel optimization