Kevin: Multi-Turn RL for Generating CUDA Kernels

๐Ÿ“… 2025-07-16
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

222K/year
๐Ÿค– AI Summary
To address the low efficiency of manual CUDA kernel development and the insufficient correctness and performance of automated generation methods, this paper introduces multi-turn reinforcement learning (RL) for end-to-end CUDA kernel generation and iterative optimizationโ€”the first such application. We propose an RL framework supporting long-trajectory training and cross-turn reward attribution, built upon the QwQ-32B foundation model. The policy model is trained with execution feedback and supports two inference modes: serial refinement and parallel sampling. Experiments show that kernel correctness improves from 56% to 82%, and average speedup increases from 0.53ร— to 1.10ร—, significantly outperforming baselines including o4-mini. Serial refinement demonstrates superior scalability. This work establishes a transferable multi-turn optimization paradigm for AI-driven systems-level code generation.

Technology Category

Application Category

๐Ÿ“ Abstract
Writing GPU kernels is a challenging task and critical for AI systems' efficiency. It is also highly iterative: domain experts write code and improve performance through execution feedback. Moreover, it presents verifiable rewards like correctness and speedup, making it a natural environment to apply Reinforcement Learning (RL). To explicitly incorporate the iterative nature of this process into training, we develop a flexible multi-turn RL recipe that addresses unique challenges encountered in real-world settings, such as learning from long trajectories and effective reward attribution across turns. We present Kevin - K(ernel D)evin, the first model trained with multi-turn RL for CUDA kernel generation and optimization. In our evaluation setup, Kevin shows significant gains over its base model (QwQ-32B), improving correctness of generated kernels (in pure CUDA) from 56% to 82% and mean speedup from 0.53x to 1.10x of baseline (PyTorch Eager), and surpassing frontier models like o4-mini (0.78x). Finally, we study its behavior across test-time scaling axes: we found scaling serial refinement more beneficial than parallel sampling. In particular, when given more refinement turns, Kevin shows a higher rate of improvement.
Problem

Research questions and friction points this paper is trying to address.

Improving GPU kernel generation efficiency using RL
Addressing iterative optimization challenges in CUDA kernels
Enhancing correctness and speedup in AI systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-turn RL for iterative CUDA kernel optimization
Learning from long trajectories with reward attribution
Scaling serial refinement boosts performance gains
๐Ÿ”Ž Similar Papers
No similar papers found.