Improving Efficiency of GPU Kernel Optimization Agents using a Domain-Specific Language and Speed-of-Light Guidance

📅 2026-03-30

📈 Citations: 0

✨ Influential: 0

career value

235K/year

🤖 AI Summary

Existing LLM-based agents for GPU kernel optimization suffer from low search efficiency due to the vast design space, often converging to suboptimal solutions. To address this, this work proposes an efficient optimization framework that first introduces μCUTLASS, a compact domain-specific language enabling high-level reasoning while preserving essential optimization degrees of freedom. Furthermore, it incorporates a Speed-of-Light performance ceiling–guided mechanism that dynamically allocates search budgets to avoid diminishing returns and benchmark gaming. Experimental results demonstrate that the proposed approach achieves an average speedup of 1.56× over PyTorch, with a peak of 1.68×. Notably, even weaker models outperform strong baselines while reducing token consumption by 19–43% and retaining over 95% of the acceleration benefit.

Technology Category

Application Category

📝 Abstract

Optimizing GPU kernels with LLM agents is an iterative process over a large design space. Every candidate must be generated, compiled, validated, and profiled, so fewer trials will save both runtime and cost. We make two key observations. First, the abstraction level that agents operate at is important. If it is too low, the LLM wastes reasoning on low-impact details. If it is too high, it may miss important optimization choices. Second, agents cannot easily tell when they reach the point of diminishing returns, wasting resources as they continue searching. These observations motivate two design principles to improve efficiency: (1) a compact domain-specific language (DSL) that can be learned in context and lets the model reason at a higher level while preserving important optimization levers, and (2) Speed-of-Light (SOL) guidance that uses first-principles performance bounds to steer and budget search. We implement these principles in $μ$CUTLASS, a DSL with a compiler for CUTLASS-backed GPU kernels that covers kernel configuration, epilogue fusion, and multi-stage pipelines. We use SOL guidance to estimate headroom and guide optimization trials, deprioritize problems that are near SOL, and flag kernels that game the benchmark. On 59 KernelBench problems with the same iteration budgets, switching from generating low-level code to DSL code using GPT-5-mini turns a 0.40x geomean regression into a 1.27x speedup over PyTorch. Adding SOL-guided steering raises this to 1.56x. Across model tiers, $μ$CUTLASS + SOL-guidance lets weaker models outperform stronger baseline agents at lower token cost. SOL-guided budgeting saves 19-43% of tokens while retaining at least 95% of geomean speedup, with the best policy reaching a 1.68x efficiency gain. Lastly, SOL analysis helps detect benchmark-gaming cases, where kernels may appear fast while failing to perform the intended computation.

Problem

Research questions and friction points this paper is trying to address.

GPU kernel optimization

LLM agents

design space search

diminishing returns

optimization efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Domain-Specific Language

Speed-of-Light Guidance

GPU Kernel Optimization