CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation

📅 2026-02-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of current large language models in generating and optimizing CUDA kernels, which struggle to surpass specialized compilers like torch.compile due to the absence of effective training mechanisms and feedback loops. To overcome this, we propose the first end-to-end optimization system that integrates large-scale agent-based reinforcement learning with scalable synthetic data generation, a skill-augmented CUDA programming environment, automated correctness verification, performance profiling, and a stable, efficient RL training framework. By moving beyond training-free or fixed-feedback paradigms, our approach achieves state-of-the-art results on KernelBench: it outperforms torch.compile by 100%, 100%, and 92% on Levels 1, 2, and 3, respectively, and significantly exceeds the performance of Claude Opus 4.5 and Gemini 3 Pro by approximately 40% on Level 3.

Technology Category

Application Category

📝 Abstract
GPU kernel optimization is fundamental to modern deep learning but remains a highly specialized task requiring deep hardware expertise. Despite strong performance in general programming, large language models (LLMs) remain uncompetitive with compiler-based systems such as torch.compile for CUDA kernel generation. Existing CUDA code generation approaches either rely on training-free refinement or fine-tune models within fixed multi-turn execution-feedback loops, but both paradigms fail to fundamentally improve the model's intrinsic CUDA optimization ability, resulting in limited performance gains. We present CUDA Agent, a large-scale agentic reinforcement learning system that develops CUDA kernel expertise through three components: a scalable data synthesis pipeline, a skill-augmented CUDA development environment with automated verification and profiling to provide reliable reward signals, and reinforcement learning algorithmic techniques enabling stable training. CUDA Agent achieves state-of-the-art results on KernelBench, delivering 100\%, 100\%, and 92\% faster rate over torch.compile on KernelBench Level-1, Level-2, and Level-3 splits, outperforming the strongest proprietary models such as Claude Opus 4.5 and Gemini 3 Pro by about 40\% on the hardest Level-3 setting.
Problem

Research questions and friction points this paper is trying to address.

CUDA kernel generation
large language models
GPU optimization
reinforcement learning
code generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

CUDA kernel generation
agentic reinforcement learning
automated code optimization
GPU performance
large language models
🔎 Similar Papers
No similar papers found.
Weinan Dai
Weinan Dai
Tsinghua University
Artificial IntelligenceLarge Language ModelsReinforcement Learning
Hanlin Wu
Hanlin Wu
Tsinghua University
Generative ModelsAI for Science
Qiying Yu
Qiying Yu
Tsinghua University
Multimodal LearningSelf-supervised LearningLarge Models
Huan-ang Gao
Huan-ang Gao
Ph.D. student, Tsinghua University
AgentVision & Robotics
J
Jiahao Li
ByteDance Seed
C
Chengquan Jiang
ByteDance Seed
W
Weiqiang Lou
ByteDance Seed
Yufan Song
Yufan Song
Carnegie Mellon University
AI AgentsML System
H
Hongli Yu
ByteDance Seed; Institute for AI Industry Research (AIR), Tsinghua University; SIA-Lab of Tsinghua AIR and ByteDance Seed
Jiaze Chen
Jiaze Chen
Bytedance
Natural Language Processing
Wei-Ying Ma
Wei-Ying Ma
Tsinghua University
Generative AI and Large Language Models (LLMs) for Science
Y
Ya-Qin Zhang
Institute for AI Industry Research (AIR), Tsinghua University; SIA-Lab of Tsinghua AIR and ByteDance Seed
J
Jingjing Liu
Institute for AI Industry Research (AIR), Tsinghua University; SIA-Lab of Tsinghua AIR and ByteDance Seed
Mingxuan Wang
Mingxuan Wang
ByteDance LLM Team
LLM
Xin Liu
Xin Liu
Bytedance MLSys
MLSys
Hao Zhou
Hao Zhou
Bytedance
Computer VisionMultimodal AIVideo UnderstandingSign Language Processing