PRAGMA: A Profiling-Reasoned Multi-Agent Framework for Automatic Kernel Optimization

📅 2025-11-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing automatic kernel generation systems rely on coarse-grained feedback—such as functional correctness or end-to-end execution time—and lack fine-grained reasoning about hardware-level performance bottlenecks, hindering efficient kernel optimization. This paper introduces the first performance-analysis-driven multi-agent large language model (LLM) framework, which deeply integrates runtime hardware profiling signals—including L1 cache misses and instruction throughput—into the LLM’s iterative reasoning loop, synergistically combining execution feedback with a historical best-version retention mechanism for progressive code refinement. Its key innovations are: (i) the first closed-loop incorporation of fine-grained hardware performance insights into the LLM-based kernel generation pipeline, and (ii) unified support for both CPU and GPU backends. Evaluated on KernelBench, our approach achieves substantial speedups over a no-profilng baseline: 2.81× over Torch on CPU and 2.30× on GPU.

Technology Category

Application Category

📝 Abstract
Designing high-performance kernels requires expert-level tuning and a deep understanding of hardware characteristics. Recent advances in large language models (LLMs) have enabled automated kernel generation, yet most existing systems rely solely on correctness or execution time feedback, lacking the ability to reason about low-level performance bottlenecks. In this paper, we introduce PRAGMA, a profile-guided AI kernel generation framework that integrates execution feedback and fine-grained hardware profiling into the reasoning loop. PRAGMA enables LLMs to identify performance bottlenecks, preserve historical best versions, and iteratively refine code quality. We evaluate PRAGMA on KernelBench, covering GPU and CPU backends. Results show that PRAGMA consistently outperforms baseline AIKG without profiling enabled and achieves 2.81$ imes$ and 2.30$ imes$ averaged speedups against Torch on CPU and GPU platforms, respectively.
Problem

Research questions and friction points this paper is trying to address.

Automating expert-level kernel optimization using LLMs and profiling feedback
Addressing performance bottleneck reasoning in AI-generated kernel codes
Integrating hardware profiling with iterative code refinement for speedups
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates hardware profiling into AI reasoning loop
Enables LLMs to identify performance bottlenecks automatically
Iteratively refines code using profile-guided optimization
🔎 Similar Papers
No similar papers found.
K
Kelun Lei
School of Computer Science and Engineering, Beihang University, Beijing, China
H
Hailong Yang
School of Computer Science and Engineering, Beihang University, Beijing, China
H
Huaitao Zhang
School of Computer Science and Engineering, Beihang University, Beijing, China
Xin You
Xin You
Beihang University
Performance Tool、HPC
K
Kaige Zhang
School of Computer Science and Engineering, Beihang University, Beijing, China
Zhongzhi Luan
Zhongzhi Luan
Beihang University
Y
Yi Liu
School of Computer Science and Engineering, Beihang University, Beijing, China
D
Depei Qian
School of Computer Science and Engineering, Beihang University, Beijing, China