CUDA-LLM: LLMs Can Write Efficient CUDA Kernels

📅 2025-06-10

📈 Citations: 0

✨ Influential: 0

career value

238K/year

🤖 AI Summary

This work addresses the challenge of automatically generating high-performance CUDA kernels tailored to GPU hardware characteristics. We propose the Feature Search and Reinforcement (FSR) framework—the first LLM-based approach enabling end-to-end kernel generation while jointly guaranteeing functional correctness, compilation success, and empirically measured latency optimality. FSR integrates hardware-aware prompting, compilation-feedback-driven reinforcement learning, real GPU latency evaluation, and multi-iteration optimization. Evaluated on AI and compute-intensive operators, FSR achieves 100% functional correctness and delivers up to 179× speedup over hand-written CUDA kernels—substantially outperforming existing LLM-based code generation methods. This work establishes a novel paradigm for LLM-driven hardware–software co-design and compiler-aware programming.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have demonstrated strong capabilities in general-purpose code generation. However, generating the code which is deeply hardware-specific, architecture-aware, and performance-critical, especially for massively parallel GPUs, remains a complex challenge. In this work, we explore the use of LLMs for the automated generation and optimization of CUDA programs, with the goal of producing high-performance GPU kernels that fully exploit the underlying hardware. To address this challenge, we propose a novel framework called extbf{Feature Search and Reinforcement (FSR)}. FSR jointly optimizes compilation and functional correctness, as well as the runtime performance, which are validated through extensive and diverse test cases, and measured by actual kernel execution latency on the target GPU, respectively. This approach enables LLMs not only to generate syntactically and semantically correct CUDA code but also to iteratively refine it for efficiency, tailored to the characteristics of the GPU architecture. We evaluate FSR on representative CUDA kernels, covering AI workloads and computational intensive algorithms. Our results show that LLMs augmented with FSR consistently guarantee correctness rates. Meanwhile, the automatically generated kernels can outperform general human-written code by a factor of up to 179$ imes$ in execution speeds. These findings highlight the potential of combining LLMs with performance reinforcement to automate GPU programming for hardware-specific, architecture-sensitive, and performance-critical applications.

Problem

Research questions and friction points this paper is trying to address.

Generate efficient CUDA kernels using LLMs

Optimize GPU code for hardware-specific performance

Ensure correctness and speed in parallel computing

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs generate and optimize CUDA kernels

FSR jointly optimizes correctness and performance

Automated kernels outperform human-written code

🔎 Similar Papers

LASSI: An LLM-Based Automated Self-Correcting Pipeline for Translating Parallel Scientific Codes