🤖 AI Summary
To address the low efficiency of manual CUDA kernel development and the insufficient correctness and performance of automated generation methods, this paper introduces multi-turn reinforcement learning (RL) for end-to-end CUDA kernel generation and iterative optimization—the first such application. We propose an RL framework supporting long-trajectory training and cross-turn reward attribution, built upon the QwQ-32B foundation model. The policy model is trained with execution feedback and supports two inference modes: serial refinement and parallel sampling. Experiments show that kernel correctness improves from 56% to 82%, and average speedup increases from 0.53× to 1.10×, significantly outperforming baselines including o4-mini. Serial refinement demonstrates superior scalability. This work establishes a transferable multi-turn optimization paradigm for AI-driven systems-level code generation.
📝 Abstract
Writing GPU kernels is a challenging task and critical for AI systems' efficiency. It is also highly iterative: domain experts write code and improve performance through execution feedback. Moreover, it presents verifiable rewards like correctness and speedup, making it a natural environment to apply Reinforcement Learning (RL). To explicitly incorporate the iterative nature of this process into training, we develop a flexible multi-turn RL recipe that addresses unique challenges encountered in real-world settings, such as learning from long trajectories and effective reward attribution across turns. We present Kevin - K(ernel D)evin, the first model trained with multi-turn RL for CUDA kernel generation and optimization. In our evaluation setup, Kevin shows significant gains over its base model (QwQ-32B), improving correctness of generated kernels (in pure CUDA) from 56% to 82% and mean speedup from 0.53x to 1.10x of baseline (PyTorch Eager), and surpassing frontier models like o4-mini (0.78x). Finally, we study its behavior across test-time scaling axes: we found scaling serial refinement more beneficial than parallel sampling. In particular, when given more refinement turns, Kevin shows a higher rate of improvement.