Kernel-Smith: A Unified Recipe for Evolutionary Kernel Optimization

📅 2026-03-30

📈 Citations: 0

✨ Influential: 0

career value

254K/year

🤖 AI Summary

This work addresses the challenges of unstable search, poor generalization, and limited cross-platform adaptability in automatic GPU kernel generation by introducing a unified evolutionary optimization framework. For the first time, a large language model is integrated not as a one-shot generator but as a powerful local improver within the evolutionary loop. The approach synergistically combines population-based evolutionary search, structured execution feedback—encompassing compilation success, correctness, and speedup—and post-training fine-tuning, leveraging long-term evolutionary trajectories to generate step-level supervision and reinforcement learning signals. The resulting model, Kernel-Smith-235B-RL, achieves state-of-the-art performance on KernelBench, outperforming Gemini-3.0-Pro and Claude-4.6-Opus in average speedup. Its MetaX variant, Kernel-Smith-MACA-30B, also significantly surpasses DeepSeek-V3.2-Think and has been deployed in production systems such as SGLang and LMDeploy.

Technology Category

Application Category

📝 Abstract

We present Kernel-Smith, a framework for high-performance GPU kernel and operator generation that combines a stable evaluation-driven evolutionary agent with an evolution-oriented post-training recipe. On the agent side, Kernel-Smith maintains a population of executable candidates and iteratively improves them using an archive of top-performing and diverse programs together with structured execution feedback on compilation, correctness, and speedup. To make this search reliable, we build backend-specific evaluation services for Triton on NVIDIA GPUs and Maca on MetaX GPUs. On the training side, we convert long-horizon evolution trajectories into step-centric supervision and reinforcement learning signals by retaining correctness-preserving, high-gain revisions, so that the model is optimized as a strong local improver inside the evolutionary loop rather than as a one-shot generator. Under a unified evolutionary protocol, Kernel-Smith-235B-RL achieves state-of-the-art overall performance on KernelBench with Nvidia Triton backend, attaining the best average speedup ratio and outperforming frontier proprietary models including Gemini-3.0-pro and Claude-4.6-opus. We further validate the framework on the MetaX MACA backend, where our Kernel-Smith-MACA-30B surpasses large-scale counterparts such as DeepSeek-V3.2-think and Qwen3-235B-2507-think, highlighting potential for seamless adaptation across heterogeneous platforms. Beyond benchmark results, the same workflow produces upstream contributions to production systems including SGLang and LMDeploy, demonstrating that LLM-driven kernel optimization can transfer from controlled evaluation to practical deployment.

Problem

Research questions and friction points this paper is trying to address.

GPU kernel optimization

evolutionary search

code generation

heterogeneous platforms

performance portability

Innovation

Methods, ideas, or system contributions that make the work stand out.

evolutionary kernel optimization

GPU kernel generation

LLM-driven code optimization