FastKernels: Benchmarking GPU Kernel Generation in Production

📅 2026-05-22
📈 Citations: 0
Influential: 0
📄 PDF

career value

248K/year
🤖 AI Summary
Existing benchmarks for GPU kernel generation are disconnected from real-world inference systems, leading to interface incompatibilities, compilation conflicts, and degraded correctness. To address this gap, this work proposes FastKernels—the first production-oriented GPU kernel benchmark—covering 46 representative architectures that span 96.2% of HuggingFace Transformers models and integrating a lightweight, high-performance inference framework to enable direct kernel deployment. FastKernels aligns kernel generation evaluation with production environments by providing realistic compilation contexts and integrable interfaces, thereby eliminating sandbox-induced biases. Experiments demonstrate that FastKernels matches the performance of leading LLM serving systems such as vLLM and SGLang on mainstream architectures and significantly outperforms reference implementations on edge cases. Moreover, it reveals that even the strongest LLM-based kernel generation agents achieve only 0.94× end-to-end speedup, highlighting critical performance bottlenecks in real-world scenarios.
📝 Abstract
LLM-based agents for GPU kernel generation are advancing rapidly, yet their progress is fundamentally constrained by the benchmarks they optimize against. Existing benchmarks are poorly aligned with production inference frameworks: they evaluate kernels on a single GPU with synthetic inputs, ignore the surrounding compilation stack, and reward replicating known optimizations rather than discovering new ones. The resulting reward signals are misleading: agents learn to generate kernels that score well in sandboxes but introduce interface incompatibilities, compilation-stack conflicts, and silent correctness degradation when integrated into real systems. We introduce FastKernels, a kernel benchmark built around a minimal set of 46 representative architectures spanning 8 categories, whose kernels collectively subsume those of 96.2% (409/425) of HuggingFace Transformers architectures. FastKernels doubles as a minimalistic, production-grade inference framework that runs at parity with hardened systems such as vLLM and SGLang on mainstream LLM serving and substantially exceeds upstream references on under-served architectures; each task's interface mirrors the corresponding module in the state-of-the-art library for its architecture family, enabling direct deployment of optimized kernels into production codebases. Evaluating state-of-the-art kernel agents on FastKernels, we find that even the strongest agent achieves only 0.94$\times$ aggregate speedup over production baselines, with weaker agents at $0.78\times$ and $0.53\times$ -- confirming that benchmark-production misalignment is a critical bottleneck for the field. We release FastKernels as a stepping stone toward kernel agents whose benchmark gains translate directly into production throughput improvements. Code is available at https://github.com/Snowflake-AI-Research/fastkernels
Problem

Research questions and friction points this paper is trying to address.

GPU kernel generation
benchmark misalignment
production inference
LLM-based agents
kernel optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

GPU kernel generation
production-aligned benchmark
FastKernels
LLM-based agents
inference optimization
🔎 Similar Papers
No similar papers found.