🤖 AI Summary
This work addresses the high manual tuning cost of half-precision general matrix multiplication (HGEMM) CUDA kernels and the difficulty of surpassing closed-source libraries (e.g., cuBLAS/cuBLASLt). We propose the first automated optimization framework integrating large language models (LLMs) with reinforcement learning (RL). The LLM generates, rewrites, and constrains CUDA kernel code, while RL—using measured execution speed as reward—efficiently searches the thousand-scale configuration space for optimal operators. Our key innovation is embedding the LLM as a structured prior into the RL policy, thereby overcoming human expertise bottlenecks. Experiments show that, in offline mode, our method achieves average speedups of 22.0% over torch.matmul, 19.2% over cuBLAS, and 11.4% over cuBLASLt’s auto-tuning. In server deployment, peak acceleration reaches 28.7%, significantly advancing the state-of-the-art in HGEMM performance.
📝 Abstract
In this paper, we propose CUDA-L2, a system that combines large language models (LLMs) and reinforcement learning (RL) to automatically optimize Half-precision General Matrix Multiply (HGEMM) CUDA kernels. Using CUDA execution speed as the RL reward, CUDA-L2 automatically optimizes HGEMM kernels across 1,000 configurations. CUDA-L2 systematically outperforms major matmul baselines to date, from the widely-used {it torch.matmul} to state-of-the-art Nvidia's closed-source libraries, i.e., {it cuBLAS}, {it cuBLASLt}. In offline mode, where kernels are executed consecutively without time intervals, CUDA-L2 yields +22.0% over {it torch.matmul} on average; +19.2% over {it cuBLAS} using the optimal layout configuration (normal-normal NN and transposed-normal TN); +16.8% over {it cuBLASLt-heuristic}, which queries {it cuBLASLt} library and selects the algorithm based on the heuristic's suggestion; and +11.4% over the most competitive {it cuBLASLt-AutoTuning} model, which selects the fastest algorithm from up to 100 candidates from {it cuBLASLt}'s suggestions. In server mode, where kernels are executed at random intervals simulating real-time inference, the speedups further increase to +28.7%, +26.0%, +22.4%, and +15.9% for {it torch.matmul}, {it cuBLAS}, {it cuBLASLt-heuristic}, and {it cuBLASLt-AutoTuning} respectively. CUDA-L2 shows that even the most performance-critical, heavily-optimized kernels like HGEMM can be improved through LLM-guided RL automation by systematically exploring configuration spaces at scales impractical for humans. Project and code can be found at github.com/deepreinforce-ai/CUDA-L2