🤖 AI Summary
Existing automatic kernel generation systems rely on coarse-grained feedback—such as functional correctness or end-to-end execution time—and lack fine-grained reasoning about hardware-level performance bottlenecks, hindering efficient kernel optimization. This paper introduces the first performance-analysis-driven multi-agent large language model (LLM) framework, which deeply integrates runtime hardware profiling signals—including L1 cache misses and instruction throughput—into the LLM’s iterative reasoning loop, synergistically combining execution feedback with a historical best-version retention mechanism for progressive code refinement. Its key innovations are: (i) the first closed-loop incorporation of fine-grained hardware performance insights into the LLM-based kernel generation pipeline, and (ii) unified support for both CPU and GPU backends. Evaluated on KernelBench, our approach achieves substantial speedups over a no-profilng baseline: 2.81× over Torch on CPU and 2.30× on GPU.
📝 Abstract
Designing high-performance kernels requires expert-level tuning and a deep understanding of hardware characteristics. Recent advances in large language models (LLMs) have enabled automated kernel generation, yet most existing systems rely solely on correctness or execution time feedback, lacking the ability to reason about low-level performance bottlenecks. In this paper, we introduce PRAGMA, a profile-guided AI kernel generation framework that integrates execution feedback and fine-grained hardware profiling into the reasoning loop. PRAGMA enables LLMs to identify performance bottlenecks, preserve historical best versions, and iteratively refine code quality. We evaluate PRAGMA on KernelBench, covering GPU and CPU backends. Results show that PRAGMA consistently outperforms baseline AIKG without profiling enabled and achieves 2.81$ imes$ and 2.30$ imes$ averaged speedups against Torch on CPU and GPU platforms, respectively.