KernelBench: Can LLMs Write Efficient GPU Kernels?

📅 2025-02-14

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

This work addresses the challenge of automatically generating efficient GPU kernels for large language models (LLMs). We introduce KernelBench, the first open-source benchmark targeting realistic machine learning operators—comprising 250 PyTorch operators—and jointly evaluating functional correctness and measured speedup. We propose a novel metric, *fast_p*, quantifying performance gain of generated kernels over PyTorch baselines *only when functionally correct*. Furthermore, we establish the first end-to-end GPU kernel generation evaluation framework, integrating LLM inference, iterative execution feedback, performance-profiler-driven optimization, and low-level operator verification. Experimental results show that state-of-the-art LLMs natively generate kernels achieving baseline performance on fewer than 20% of tasks. Incorporating execution feedback significantly improves success rates; however, gains diminish sharply as the required speedup threshold *p* increases.

Technology Category

Application Category

📝 Abstract

Efficient GPU kernels are crucial for building performant machine learning architectures, but writing them is a time-consuming challenge that requires significant expertise; therefore, we explore using language models (LMs) to automate kernel generation. We introduce KernelBench, an open-source framework for evaluating LMs' ability to write fast and correct kernels on a suite of 250 carefully selected PyTorch ML workloads. KernelBench represents a real-world engineering environment and making progress on the introduced benchmark directly translates to faster practical kernels. We introduce a new evaluation metric fast_p, which measures the percentage of generated kernels that are functionally correct and offer a speedup greater than an adjustable threshold p over baseline. Our experiments across various state-of-the-art models and test-time methods show that frontier reasoning models perform the best out of the box but still fall short overall, matching the PyTorch baseline in less than 20% of the cases. While we show that results can improve by leveraging execution and profiling feedback during iterative refinement, KernelBench remains a challenging benchmark, with its difficulty increasing as we raise speedup threshold p.

Problem

Research questions and friction points this paper is trying to address.

Automate GPU kernel generation

Evaluate LMs on PyTorch workloads

Measure kernel speedup and correctness

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs automate GPU kernel generation

KernelBench evaluates kernel efficiency

fast_p metric assesses kernel speedup

🔎 Similar Papers

No similar papers found.