CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning

📅 2025-12-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high manual tuning cost of half-precision general matrix multiplication (HGEMM) CUDA kernels and the difficulty of surpassing closed-source libraries (e.g., cuBLAS/cuBLASLt). We propose the first automated optimization framework integrating large language models (LLMs) with reinforcement learning (RL). The LLM generates, rewrites, and constrains CUDA kernel code, while RL—using measured execution speed as reward—efficiently searches the thousand-scale configuration space for optimal operators. Our key innovation is embedding the LLM as a structured prior into the RL policy, thereby overcoming human expertise bottlenecks. Experiments show that, in offline mode, our method achieves average speedups of 22.0% over torch.matmul, 19.2% over cuBLAS, and 11.4% over cuBLASLt’s auto-tuning. In server deployment, peak acceleration reaches 28.7%, significantly advancing the state-of-the-art in HGEMM performance.

Technology Category

Application Category

📝 Abstract
In this paper, we propose CUDA-L2, a system that combines large language models (LLMs) and reinforcement learning (RL) to automatically optimize Half-precision General Matrix Multiply (HGEMM) CUDA kernels. Using CUDA execution speed as the RL reward, CUDA-L2 automatically optimizes HGEMM kernels across 1,000 configurations. CUDA-L2 systematically outperforms major matmul baselines to date, from the widely-used {it torch.matmul} to state-of-the-art Nvidia's closed-source libraries, i.e., {it cuBLAS}, {it cuBLASLt}. In offline mode, where kernels are executed consecutively without time intervals, CUDA-L2 yields +22.0% over {it torch.matmul} on average; +19.2% over {it cuBLAS} using the optimal layout configuration (normal-normal NN and transposed-normal TN); +16.8% over {it cuBLASLt-heuristic}, which queries {it cuBLASLt} library and selects the algorithm based on the heuristic's suggestion; and +11.4% over the most competitive {it cuBLASLt-AutoTuning} model, which selects the fastest algorithm from up to 100 candidates from {it cuBLASLt}'s suggestions. In server mode, where kernels are executed at random intervals simulating real-time inference, the speedups further increase to +28.7%, +26.0%, +22.4%, and +15.9% for {it torch.matmul}, {it cuBLAS}, {it cuBLASLt-heuristic}, and {it cuBLASLt-AutoTuning} respectively. CUDA-L2 shows that even the most performance-critical, heavily-optimized kernels like HGEMM can be improved through LLM-guided RL automation by systematically exploring configuration spaces at scales impractical for humans. Project and code can be found at github.com/deepreinforce-ai/CUDA-L2
Problem

Research questions and friction points this paper is trying to address.

Optimizes half-precision matrix multiplication CUDA kernels automatically
Surpasses performance of existing libraries like cuBLAS and torch.matmul
Uses reinforcement learning guided by large language models for optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-guided reinforcement learning optimizes CUDA kernels automatically
Uses execution speed as reward to explore 1000 configurations
Outperforms cuBLAS and torch.matmul in offline and server modes
🔎 Similar Papers
No similar papers found.
S
Songqiao Su
DeepReinforce Team
Xiaofei Sun
Xiaofei Sun
Stony Brook University, Zhejiang University
Social and Information NetworkNatural Language ProcessingMachine Learning
Xiaoya Li
Xiaoya Li
University of Washington
A
Albert Wang
DeepReinforce Team
J
Jiwei Li
DeepReinforce Team
C
Chris Shum
DeepReinforce Team