KernelFoundry: Hardware-aware evolutionary GPU kernel optimization

πŸ“… 2026-03-12
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenges in optimizing GPU kernels for large language models, which stem from complex hardware architectures, diverse parallelization strategies, and indirect performance feedback. Existing approaches often lack systematic hardware-aware capabilities. To overcome these limitations, the authors propose an evolutionary hardware-aware optimization framework that integrates MAP-Elites for quality-diversity exploration, co-evolutionary meta-prompting, and template-based parameter tuning. The framework supports cross-platform kernel generation for both SYCL and CUDA and enables scalable deployment through distributed remote hardware access. Evaluated on KernelBench, the generated SYCL kernels achieve an average speedup of 2.3Γ— over baseline methods, demonstrating significant performance gains and effective adaptability across a range of real-world application scenarios.

Technology Category

Application Category

πŸ“ Abstract
Optimizing GPU kernels presents a significantly greater challenge for large language models (LLMs) than standard code generation tasks, as it requires understanding hardware architecture, parallel optimization strategies, and performance profiling outputs. Most existing LLM-based approaches to kernel generation rely on simple prompting and feedback loops, incorporating hardware awareness only indirectly through profiling feedback. We introduce KernelFoundry, an evolutionary framework that efficiently explores the GPU kernel design space through three key mechanisms: (1) MAP-Elites quality-diversity search with kernel-specific behavioral dimensions to sustain exploration across diverse optimization strategies; (2) meta-prompt evolution, which co-evolves prompts with kernels to uncover task-specific optimization strategies, and (3) template-based parameter optimization to tune kernels to inputs and hardware. We evaluate this framework on KernelBench, robust-kbench, and custom tasks, generating SYCL kernels as a cross-platform GPU programming model and CUDA kernels for comparison to prior work. Our approach consistently outperforms the baseline methods, achieving an average speedup of 2.3x on KernelBench for SYCL. Moreover, KernelFoundry is implemented as a distributed framework with remote access to diverse hardware, enabling rapid benchmarking and featuring a flexible user input layer that supports kernel generation for a wide range of real-world use cases beyond benchmarking.
Problem

Research questions and friction points this paper is trying to address.

GPU kernel optimization
hardware-aware
evolutionary optimization
parallel computing
performance tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

evolutionary optimization
hardware-aware code generation
MAP-Elites
meta-prompt evolution
GPU kernel synthesis
πŸ”Ž Similar Papers
No similar papers found.