CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning

📅 2025-12-02

📈 Citations: 0

✨ Influential: 0

career value

247K/year

🤖 AI Summary

This work addresses the high manual tuning cost of half-precision general matrix multiplication (HGEMM) CUDA kernels and the difficulty of surpassing closed-source libraries (e.g., cuBLAS/cuBLASLt). We propose the first automated optimization framework integrating large language models (LLMs) with reinforcement learning (RL). The LLM generates, rewrites, and constrains CUDA kernel code, while RL—using measured execution speed as reward—efficiently searches the thousand-scale configuration space for optimal operators. Our key innovation is embedding the LLM as a structured prior into the RL policy, thereby overcoming human expertise bottlenecks. Experiments show that, in offline mode, our method achieves average speedups of 22.0% over torch.matmul, 19.2% over cuBLAS, and 11.4% over cuBLASLt’s auto-tuning. In server deployment, peak acceleration reaches 28.7%, significantly advancing the state-of-the-art in HGEMM performance.

Technology Category

Application Category

📝 Abstract

In this paper, we propose CUDA-L2, a system that combines large language models (LLMs) and reinforcement learning (RL) to automatically optimize Half-precision General Matrix Multiply (HGEMM) CUDA kernels. Using CUDA execution speed as the RL reward, CUDA-L2 automatically optimizes HGEMM kernels across 1,000 configurations. CUDA-L2 systematically outperforms major matmul baselines to date, from the widely-used {it torch.matmul} to state-of-the-art Nvidia's closed-source libraries, i.e., {it cuBLAS}, {it cuBLASLt}. In offline mode, where kernels are executed consecutively without time intervals, CUDA-L2 yields +22.0% over {it torch.matmul} on average; +19.2% over {it cuBLAS} using the optimal layout configuration (normal-normal NN and transposed-normal TN); +16.8% over {it cuBLASLt-heuristic}, which queries {it cuBLASLt} library and selects the algorithm based on the heuristic's suggestion; and +11.4% over the most competitive {it cuBLASLt-AutoTuning} model, which selects the fastest algorithm from up to 100 candidates from {it cuBLASLt}'s suggestions. In server mode, where kernels are executed at random intervals simulating real-time inference, the speedups further increase to +28.7%, +26.0%, +22.4%, and +15.9% for {it torch.matmul}, {it cuBLAS}, {it cuBLASLt-heuristic}, and {it cuBLASLt-AutoTuning} respectively. CUDA-L2 shows that even the most performance-critical, heavily-optimized kernels like HGEMM can be improved through LLM-guided RL automation by systematically exploring configuration spaces at scales impractical for humans. Project and code can be found at github.com/deepreinforce-ai/CUDA-L2

Problem

Research questions and friction points this paper is trying to address.

Optimizes half-precision matrix multiplication CUDA kernels automatically

Surpasses performance of existing libraries like cuBLAS and torch.matmul

Uses reinforcement learning guided by large language models for optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-guided reinforcement learning optimizes CUDA kernels automatically

Uses execution speed as reward to explore 1000 configurations

Outperforms cuBLAS and torch.matmul in offline and server modes

🔎 Similar Papers

No similar papers found.