Kevin: Multi-Turn RL for Generating CUDA Kernels

📅 2025-07-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the low efficiency of manual CUDA kernel development and the insufficient correctness and performance of automated generation methods, this paper introduces multi-turn reinforcement learning (RL) for end-to-end CUDA kernel generation and iterative optimization—the first such application. We propose an RL framework supporting long-trajectory training and cross-turn reward attribution, built upon the QwQ-32B foundation model. The policy model is trained with execution feedback and supports two inference modes: serial refinement and parallel sampling. Experiments show that kernel correctness improves from 56% to 82%, and average speedup increases from 0.53× to 1.10×, significantly outperforming baselines including o4-mini. Serial refinement demonstrates superior scalability. This work establishes a transferable multi-turn optimization paradigm for AI-driven systems-level code generation.

Technology Category

Application Category

📝 Abstract
Writing GPU kernels is a challenging task and critical for AI systems' efficiency. It is also highly iterative: domain experts write code and improve performance through execution feedback. Moreover, it presents verifiable rewards like correctness and speedup, making it a natural environment to apply Reinforcement Learning (RL). To explicitly incorporate the iterative nature of this process into training, we develop a flexible multi-turn RL recipe that addresses unique challenges encountered in real-world settings, such as learning from long trajectories and effective reward attribution across turns. We present Kevin - K(ernel D)evin, the first model trained with multi-turn RL for CUDA kernel generation and optimization. In our evaluation setup, Kevin shows significant gains over its base model (QwQ-32B), improving correctness of generated kernels (in pure CUDA) from 56% to 82% and mean speedup from 0.53x to 1.10x of baseline (PyTorch Eager), and surpassing frontier models like o4-mini (0.78x). Finally, we study its behavior across test-time scaling axes: we found scaling serial refinement more beneficial than parallel sampling. In particular, when given more refinement turns, Kevin shows a higher rate of improvement.
Problem

Research questions and friction points this paper is trying to address.

Improving GPU kernel generation efficiency using RL
Addressing iterative optimization challenges in CUDA kernels
Enhancing correctness and speedup in AI systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-turn RL for iterative CUDA kernel optimization
Learning from long trajectories with reward attribution
Scaling serial refinement boosts performance gains
🔎 Similar Papers
No similar papers found.
C
Carlo Baronio
Stanford University, Cognition AI
P
Pietro Marsella
Stanford University, Cognition AI
B
Ben Pan
Stanford University, Cognition AI
Simon Guo
Simon Guo
Stanford University
Computer SystemsMachine Learning
Silas Alberti
Silas Alberti
PhD Student in AI, Stanford University
Artificial IntelligenceMachine LearningStatistical Learning Theory