Kernel-Smith: A Unified Recipe for Evolutionary Kernel Optimization

📅 2026-03-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of unstable search, poor generalization, and limited cross-platform adaptability in automatic GPU kernel generation by introducing a unified evolutionary optimization framework. For the first time, a large language model is integrated not as a one-shot generator but as a powerful local improver within the evolutionary loop. The approach synergistically combines population-based evolutionary search, structured execution feedback—encompassing compilation success, correctness, and speedup—and post-training fine-tuning, leveraging long-term evolutionary trajectories to generate step-level supervision and reinforcement learning signals. The resulting model, Kernel-Smith-235B-RL, achieves state-of-the-art performance on KernelBench, outperforming Gemini-3.0-Pro and Claude-4.6-Opus in average speedup. Its MetaX variant, Kernel-Smith-MACA-30B, also significantly surpasses DeepSeek-V3.2-Think and has been deployed in production systems such as SGLang and LMDeploy.
📝 Abstract
We present Kernel-Smith, a framework for high-performance GPU kernel and operator generation that combines a stable evaluation-driven evolutionary agent with an evolution-oriented post-training recipe. On the agent side, Kernel-Smith maintains a population of executable candidates and iteratively improves them using an archive of top-performing and diverse programs together with structured execution feedback on compilation, correctness, and speedup. To make this search reliable, we build backend-specific evaluation services for Triton on NVIDIA GPUs and Maca on MetaX GPUs. On the training side, we convert long-horizon evolution trajectories into step-centric supervision and reinforcement learning signals by retaining correctness-preserving, high-gain revisions, so that the model is optimized as a strong local improver inside the evolutionary loop rather than as a one-shot generator. Under a unified evolutionary protocol, Kernel-Smith-235B-RL achieves state-of-the-art overall performance on KernelBench with Nvidia Triton backend, attaining the best average speedup ratio and outperforming frontier proprietary models including Gemini-3.0-pro and Claude-4.6-opus. We further validate the framework on the MetaX MACA backend, where our Kernel-Smith-MACA-30B surpasses large-scale counterparts such as DeepSeek-V3.2-think and Qwen3-235B-2507-think, highlighting potential for seamless adaptation across heterogeneous platforms. Beyond benchmark results, the same workflow produces upstream contributions to production systems including SGLang and LMDeploy, demonstrating that LLM-driven kernel optimization can transfer from controlled evaluation to practical deployment.
Problem

Research questions and friction points this paper is trying to address.

GPU kernel optimization
evolutionary search
code generation
heterogeneous platforms
performance portability
Innovation

Methods, ideas, or system contributions that make the work stand out.

evolutionary kernel optimization
GPU kernel generation
LLM-driven code optimization
backend-specific evaluation
reinforcement learning for compilation
🔎 Similar Papers
No similar papers found.
He Du
He Du
Northwestern Polytechnical University
ubiquitous computingdata miningmobile sensing
Q
Qiming Ge
Shanghai AI Laboratory
J
Jiakai Hu
Shanghai AI Laboratory
A
Aijun Yang
Shanghai AI Laboratory
Z
Zheng Cai
Shanghai AI Laboratory
Zixian Huang
Zixian Huang
Shanghai AI Lab
Question AnsweringNatural Language Processing
S
Sheng Yuan
Shanghai AI Laboratory
Q
Qinxiu Cheng
Shanghai AI Laboratory
X
Xinchen Xie
Shanghai AI Laboratory
Yicheng Chen
Yicheng Chen
Fudan University, Shanghai AI Lab
Computer Vision
Yining Li
Yining Li
Shanghai AI Laboratory
Multimodal LearningLarge Language Model
J
Jiaxing Xie
MetaX
H
Huanan Dong
MetaX
Y
Yaguang Wu
MetaX
X
Xiangjun Huang
MetaX
Jian Yang
Jian Yang
Facebook
NLPMachine Learning
H
Hui Wang
Shanghai AI Laboratory
B
Bowen Zhou
Shanghai AI Laboratory
Bowen Li
Bowen Li
Shanghai AI Lab
natural language processing
Qipeng Guo
Qipeng Guo
Fudan University
Kai Chen
Kai Chen
Shanghai AI Laboratory
LLMVLMComputer Vision