Learning When to Optimize: Verified Optimization Skills from Expert GPU-Kernel Lineages

📅 2026-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses a critical limitation of large language models (LLMs) in GPU kernel generation: while they know *what* optimizations to apply, they lack awareness of *when* those optimizations are safe and effective. To bridge this gap, the authors propose the first method to reverse-engineer transferable optimization skills—with explicit validity conditions—from expert-written kernel families. Each skill precisely specifies its applicable scenarios, preconditions for effectiveness, expected performance gains, and pitfalls to avoid. By integrating reverse simplification, multidimensional validation gating (covering compilation, correctness, and performance), and formalized skill representation with LLM-guided optimization, the approach significantly outperforms existing memory-based methods across five workloads on two NVIDIA architectures. Under identical computational budgets, it achieves higher kernel quality and optimization efficiency, with strong generalization demonstrated across 22 independent test cases and no evidence of overfitting.
📝 Abstract
LLM-based agents are increasingly used to generate GPU kernels, but they often know what optimizations to try without knowing when those optimizations are sound. We introduce KLineage, which learns this missing "when" knowledge from expert kernels: instead of relying on forward rollouts, KLineage walks expert implementations backward through validation-gated simplifications and reverses each accepted step into a reusable optimization skill. Each skill records not only the optimization intent, but also where it applies in code, what conditions made it valid, what effect it had, and what failures its assumptions avoid. A downstream LLM materializes these skills on new code surfaces under the same compile/correctness/profile gate. On five expert workloads across two NVIDIA architectures, these lineage-derived skills serve as an effective optimization curriculum, exceeding recent memory-based LLM-kernel baselines in both final kernel quality and optimization efficiency under the same fixed budget. We additionally use a separate 22-instance held-out check as a sanity test against source-case memorization.
Problem

Research questions and friction points this paper is trying to address.

GPU kernel optimization
LLM-based code generation
optimization applicability
program synthesis
verified transformations
Innovation

Methods, ideas, or system contributions that make the work stand out.

KLineage
GPU kernel optimization
optimization skills
validation-gated simplification
LLM-based code generation
🔎 Similar Papers
No similar papers found.
S
Shuoming Zhang
SKLP, Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences
Q
Qiuchu Yu
SKLP, Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences
Y
Yangyu Zhang
SKLP, Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences
R
Ruiyuan Xu
SKLP, Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences
Xiyu Shi
Xiyu Shi
Institute for Digital Technologies, Loughborough University London
Speech signal processmobile and wireless communicationnetwork securityInternet of things
G
Guangli Li
SKLP, Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences; University of New South Wales
Xiaobing Feng
Xiaobing Feng
Professor of Institute of Computing Technology, Chinese Academy of Sciecnes
Programming ModelProgramming Analysis and Optimizationg
H
Huimin Cui
SKLP, Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences
Jiacheng Zhao
Jiacheng Zhao
Institute of Computing Technology, Chinese Academy of Scienses
Parallel ComputingParallel CompilingComputer ArchitectureProgramming ModelDatacenter