cuPilot: A Strategy-Coordinated Multi-agent Framework for CUDA Kernel Evolution

📅 2025-12-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Addressing challenges in automated CUDA kernel optimization—including tight hardware-software co-design difficulties, inefficient surrogate modeling, and semantic misalignment between evolutionary representations and hardware constraints—this paper proposes a strategy-coordinated multi-agent framework. We introduce “strategy” as a semantic abstraction to model kernel evolution, pioneering strategy-level evolutionary representation and cross-agent coordination mechanisms. Integrating Roofline-guided LLM prompting with strategy-aware population initialization enables hardware-informed end-to-end kernel generation. By unifying large language models, evolutionary algorithms, Roofline performance modeling, and multi-agent scheduling, our approach achieves an average 3.09× speedup over PyTorch across 100 kernel benchmarks; notably, it improves utilization of critical hardware units (e.g., tensor cores, shared memory) in GEMM workloads. All generated kernels are open-sourced.

Technology Category

Application Category

📝 Abstract
Optimizing CUDA kernels is a challenging and labor-intensive task, given the need for hardware-software co-design expertise and the proprietary nature of high-performance kernel libraries. While recent large language models (LLMs) combined with evolutionary algorithms show promise in automatic kernel optimization, existing approaches often fall short in performance due to their suboptimal agent designs and mismatched evolution representations. This work identifies these mismatches and proposes cuPilot, a strategy-coordinated multi-agent framework that introduces strategy as an intermediate semantic representation for kernel evolution. Key contributions include a strategy-coordinated evolution algorithm, roofline-guided prompting, and strategy-level population initialization. Experimental results show that the generated kernels by cuPilot achieve an average speed up of 3.09$ imes$ over PyTorch on a benchmark of 100 kernels. On the GEMM tasks, cuPilot showcases sophisticated optimizations and achieves high utilization of critical hardware units. The generated kernels are open-sourced at https://github.com/champloo2878/cuPilot-Kernels.git.
Problem

Research questions and friction points this paper is trying to address.

Optimizes CUDA kernels automatically using multi-agent framework
Addresses performance gaps in existing LLM-based kernel evolution methods
Enhances hardware utilization through strategy-coordinated evolution algorithm
Innovation

Methods, ideas, or system contributions that make the work stand out.

Strategy-coordinated multi-agent framework for kernel evolution
Roofline-guided prompting to optimize CUDA kernel performance
Strategy-level population initialization for evolutionary algorithm
🔎 Similar Papers
No similar papers found.
J
Jinwu Chen
Southeast University, Nanjing, China
Q
Qidie Wu
Tsinghua University, Beijing, China
B
Bin Li
Tsing Micro, Beijing, China
L
Lin Ma
Tsing Micro, Beijing, China
Xin Si
Xin Si
Southeast University
MemoryComputation in memoryAI processorAnalog/mixed signal circuit
Y
Yang Hu
Tsinghua University, Beijing, China
Shouyi Yin
Shouyi Yin
Tsinghua University
J
Jun Yang
National Center of Technology Innovation for EDA, Nanjing, China