🤖 AI Summary
This work addresses the challenges of reward hacking and lazy optimization that hinder large language models in generating high-performance, correct Triton kernels. To overcome these issues, the authors propose Dr.Kernel-14B, a 14-billion-parameter model trained via a multi-round reinforcement learning framework within the distributed GPU environment KernelGYM. The approach introduces the TRLOO algorithm to eliminate policy gradient bias caused by self-containment in GRPO, integrates profiling-based rewards with a mismatch correction mechanism to mitigate lazy optimization, and employs rejection sampling to enhance generation quality. Experimental results demonstrate that Dr.Kernel-14B achieves speedups of over 1.2× on 31.6% of kernels in KernelBench Level-2, outperforming Claude-4.5-Sonnet (26.7%) and GPT-5 (28.6%); this success rate rises to 47.8% when selecting the best-performing kernel across multiple generations.
📝 Abstract
High-quality kernel is critical for scalable AI systems, and enabling LLMs to generate such code would advance AI development. However, training LLMs for this task requires sufficient data, a robust environment, and the process is often vulnerable to reward hacking and lazy optimization. In these cases, models may hack training rewards and prioritize trivial correctness over meaningful speedup. In this paper, we systematically study reinforcement learning (RL) for kernel generation. We first design KernelGYM, a robust distributed GPU environment that supports reward hacking check, data collection from multi-turn interactions and long-term RL training. Building on KernelGYM, we investigate effective multi-turn RL methods and identify a biased policy gradient issue caused by self-inclusion in GRPO. To solve this, we propose Turn-level Reinforce-Leave-One-Out (TRLOO) to provide unbiased advantage estimation for multi-turn RL. To alleviate lazy optimization, we incorporate mismatch correction for training stability and introduce Profiling-based Rewards (PR) and Profiling-based Rejection Sampling (PRS) to overcome the issue. The trained model, Dr Kernel-14B, reaches performance competitive with Claude-4.5-Sonnet in Kernelbench. Finally, we study sequential test-time scaling for Dr Kernel-14B. On the KernelBench Level-2 subset, 31.6% of the generated kernels achieve at least a 1.2x speedup over the Torch reference, surpassing Claude-4.5-Sonnet (26.7%) and GPT-5 (28.6%). When selecting the best candidate across all turns, this 1.2x speedup rate further increases to 47.8%. All resources, including environment, training code, models, and dataset, are included in https://www.github.com/hkust-nlp/KernelGYM.