🤖 AI Summary
This work addresses the limitations of current large language models in generating and optimizing CUDA kernels, which struggle to surpass specialized compilers like torch.compile due to the absence of effective training mechanisms and feedback loops. To overcome this, we propose the first end-to-end optimization system that integrates large-scale agent-based reinforcement learning with scalable synthetic data generation, a skill-augmented CUDA programming environment, automated correctness verification, performance profiling, and a stable, efficient RL training framework. By moving beyond training-free or fixed-feedback paradigms, our approach achieves state-of-the-art results on KernelBench: it outperforms torch.compile by 100%, 100%, and 92% on Levels 1, 2, and 3, respectively, and significantly exceeds the performance of Claude Opus 4.5 and Gemini 3 Pro by approximately 40% on Level 3.
📝 Abstract
GPU kernel optimization is fundamental to modern deep learning but remains a highly specialized task requiring deep hardware expertise. Despite strong performance in general programming, large language models (LLMs) remain uncompetitive with compiler-based systems such as torch.compile for CUDA kernel generation. Existing CUDA code generation approaches either rely on training-free refinement or fine-tune models within fixed multi-turn execution-feedback loops, but both paradigms fail to fundamentally improve the model's intrinsic CUDA optimization ability, resulting in limited performance gains. We present CUDA Agent, a large-scale agentic reinforcement learning system that develops CUDA kernel expertise through three components: a scalable data synthesis pipeline, a skill-augmented CUDA development environment with automated verification and profiling to provide reliable reward signals, and reinforcement learning algorithmic techniques enabling stable training. CUDA Agent achieves state-of-the-art results on KernelBench, delivering 100\%, 100\%, and 92\% faster rate over torch.compile on KernelBench Level-1, Level-2, and Level-3 splits, outperforming the strongest proprietary models such as Claude Opus 4.5 and Gemini 3 Pro by about 40\% on the hardest Level-3 setting.