KernelEvolve: Scaling Agentic Kernel Coding for Heterogeneous AI Accelerators at Meta

📅 2025-12-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Deep Learning Recommendation Models (DLRMs) face severe kernel development bottlenecks due to triple heterogeneity—diverse model architectures, operator primitives, and hardware platforms (NVIDIA/AMD GPUs and Meta’s custom accelerators). Method: This paper introduces the first proxy-based kernel programming framework that synergistically integrates graph search with retrieval-augmented generation (RAG) for prompt synthesis, enabling multi-level abstraction language co-optimization and dynamic runtime context adaptation. Built upon domain-specific languages (DSLs) such as Triton and CuTe, it supports hardware-agnostic operator modeling and fitness-driven automatic kernel generation. Contribution/Results: Evaluated on KernelBench (250 tests), the framework achieves 100% pass rate, correctly implements all 160 PyTorch ATen operators across platforms with 100% functional correctness, reduces kernel development time from weeks to hours, and consistently outperforms PyTorch baselines in performance.

Technology Category

Application Category

📝 Abstract
Making deep learning recommendation model (DLRM) training and inference fast and efficient is important. However, this presents three key system challenges - model architecture diversity, kernel primitive diversity, and hardware generation and architecture heterogeneity. This paper presents KernelEvolve-an agentic kernel coding framework-to tackle heterogeneity at-scale for DLRM. KernelEvolve is designed to take kernel specifications as input and automate the process of kernel generation and optimization for recommendation model across heterogeneous hardware architectures. KernelEvolve does so by operating at multiple programming abstractions, from Triton and CuTe DSL to low-level hardware agnostic languages, spanning the full hardware-software optimization stack. The kernel optimization process is described as graph-based search with selection policy, universal operator, fitness function, and termination rule, dynamically adapts to runtime execution context through retrieval-augmented prompt synthesis. We designed, implemented, and deployed KernelEvolve to optimize a wide variety of production recommendation models across generations of NVIDIA and AMD GPUs, as well as Meta's AI accelerators. We validate KernelEvolve on the publicly-available KernelBench suite, achieving 100% pass rate on all 250 problems across three difficulty levels, and 160 PyTorch ATen operators across three heterogeneous hardware platforms, demonstrating 100% correctness. KernelEvolve reduces development time from weeks to hours and achieves substantial performance improvements over PyTorch baselines across diverse production use cases and for heterogeneous AI systems at-scale. Beyond performance efficiency improvements, KernelEvolve significantly mitigates the programmability barrier for new AI hardware by enabling automated kernel generation for in-house developed AI hardware.
Problem

Research questions and friction points this paper is trying to address.

Automates kernel generation for diverse DLRM models on heterogeneous hardware
Reduces development time from weeks to hours by optimizing kernel coding
Mitigates programmability barriers for new AI accelerators through automation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated kernel generation for heterogeneous AI accelerators
Graph-based search with retrieval-augmented prompt synthesis
Multi-abstraction programming from DSL to hardware-agnostic languages
🔎 Similar Papers
No similar papers found.
G
Gang Liao
KernelEvolve Team, Meta Platforms
H
Hongsen Qin
KernelEvolve Team, Meta Platforms
Y
Ying Wang
KernelEvolve Team, Meta Platforms
A
Alicia Golden
KernelEvolve Team, Meta Platforms
Michael Kuchnik
Michael Kuchnik
Meta
computer systemsmachine learning
Y
Yavuz Yetim
KernelEvolve Team, Meta Platforms
J
Jia Jiunn Ang
KernelEvolve Team, Meta Platforms
C
Chunli Fu
KernelEvolve Team, Meta Platforms
Yihan He
Yihan He
PhD Candidate, Research Engineer, ECE Dept, National University of Singapore
Probabilistic ComputingComputational ModelsComputational Nanoelectronics
S
Samuel Hsia
KernelEvolve Team, Meta Platforms
Z
Zewei Jiang
KernelEvolve Team, Meta Platforms
D
Dianshi Li
KernelEvolve Team, Meta Platforms
U
Uladzimir Pashkevich
KernelEvolve Team, Meta Platforms
V
Varna Puvvada
KernelEvolve Team, Meta Platforms
F
Feng Shi
KernelEvolve Team, Meta Platforms
M
Matt Steiner
KernelEvolve Team, Meta Platforms
R
Ruichao Xiao
KernelEvolve Team, Meta Platforms
N
Nathan Yan
KernelEvolve Team, Meta Platforms
X
Xiayu Yu
KernelEvolve Team, Meta Platforms
Z
Zhou Fang
KernelEvolve Team, Meta Platforms
A
Abdul Zainul-Abedin
KernelEvolve Team, Meta Platforms
Ketan Singh
Ketan Singh
University of Southern California • Meta • Google • Apple
Machine LearningArtificial IntelligenceDistributed SystemsInformation RetrievalSearch Ranking
H
Hongtao Yu
KernelEvolve Team, Meta Platforms
W
Wenyuan Chi
KernelEvolve Team, Meta Platforms
B
Barney Huang
KernelEvolve Team, Meta Platforms