CUDA-LLM: LLMs Can Write Efficient CUDA Kernels

📅 2025-06-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of automatically generating high-performance CUDA kernels tailored to GPU hardware characteristics. We propose the Feature Search and Reinforcement (FSR) framework—the first LLM-based approach enabling end-to-end kernel generation while jointly guaranteeing functional correctness, compilation success, and empirically measured latency optimality. FSR integrates hardware-aware prompting, compilation-feedback-driven reinforcement learning, real GPU latency evaluation, and multi-iteration optimization. Evaluated on AI and compute-intensive operators, FSR achieves 100% functional correctness and delivers up to 179× speedup over hand-written CUDA kernels—substantially outperforming existing LLM-based code generation methods. This work establishes a novel paradigm for LLM-driven hardware–software co-design and compiler-aware programming.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have demonstrated strong capabilities in general-purpose code generation. However, generating the code which is deeply hardware-specific, architecture-aware, and performance-critical, especially for massively parallel GPUs, remains a complex challenge. In this work, we explore the use of LLMs for the automated generation and optimization of CUDA programs, with the goal of producing high-performance GPU kernels that fully exploit the underlying hardware. To address this challenge, we propose a novel framework called extbf{Feature Search and Reinforcement (FSR)}. FSR jointly optimizes compilation and functional correctness, as well as the runtime performance, which are validated through extensive and diverse test cases, and measured by actual kernel execution latency on the target GPU, respectively. This approach enables LLMs not only to generate syntactically and semantically correct CUDA code but also to iteratively refine it for efficiency, tailored to the characteristics of the GPU architecture. We evaluate FSR on representative CUDA kernels, covering AI workloads and computational intensive algorithms. Our results show that LLMs augmented with FSR consistently guarantee correctness rates. Meanwhile, the automatically generated kernels can outperform general human-written code by a factor of up to 179$ imes$ in execution speeds. These findings highlight the potential of combining LLMs with performance reinforcement to automate GPU programming for hardware-specific, architecture-sensitive, and performance-critical applications.
Problem

Research questions and friction points this paper is trying to address.

Generate efficient CUDA kernels using LLMs
Optimize GPU code for hardware-specific performance
Ensure correctness and speed in parallel computing
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs generate and optimize CUDA kernels
FSR jointly optimizes correctness and performance
Automated kernels outperform human-written code
🔎 Similar Papers
2024-06-302024 IEEE International Conference on Cluster Computing Workshops (CLUSTER Workshops)Citations: 3