PRAGMA: A Profiling-Reasoned Multi-Agent Framework for Automatic Kernel Optimization

📅 2025-11-09

📈 Citations: 0

✨ Influential: 0

career value

243K/year

🤖 AI Summary

Existing automatic kernel generation systems rely on coarse-grained feedback—such as functional correctness or end-to-end execution time—and lack fine-grained reasoning about hardware-level performance bottlenecks, hindering efficient kernel optimization. This paper introduces the first performance-analysis-driven multi-agent large language model (LLM) framework, which deeply integrates runtime hardware profiling signals—including L1 cache misses and instruction throughput—into the LLM’s iterative reasoning loop, synergistically combining execution feedback with a historical best-version retention mechanism for progressive code refinement. Its key innovations are: (i) the first closed-loop incorporation of fine-grained hardware performance insights into the LLM-based kernel generation pipeline, and (ii) unified support for both CPU and GPU backends. Evaluated on KernelBench, our approach achieves substantial speedups over a no-profilng baseline: 2.81× over Torch on CPU and 2.30× on GPU.

Technology Category

Application Category

📝 Abstract

Designing high-performance kernels requires expert-level tuning and a deep understanding of hardware characteristics. Recent advances in large language models (LLMs) have enabled automated kernel generation, yet most existing systems rely solely on correctness or execution time feedback, lacking the ability to reason about low-level performance bottlenecks. In this paper, we introduce PRAGMA, a profile-guided AI kernel generation framework that integrates execution feedback and fine-grained hardware profiling into the reasoning loop. PRAGMA enables LLMs to identify performance bottlenecks, preserve historical best versions, and iteratively refine code quality. We evaluate PRAGMA on KernelBench, covering GPU and CPU backends. Results show that PRAGMA consistently outperforms baseline AIKG without profiling enabled and achieves 2.81$ imes$ and 2.30$ imes$ averaged speedups against Torch on CPU and GPU platforms, respectively.

Problem

Research questions and friction points this paper is trying to address.

Automating expert-level kernel optimization using LLMs and profiling feedback

Addressing performance bottleneck reasoning in AI-generated kernel codes

Integrating hardware profiling with iterative code refinement for speedups

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates hardware profiling into AI reasoning loop

Enables LLMs to identify performance bottlenecks automatically

Iteratively refines code using profile-guided optimization

🔎 Similar Papers

System for systematic literature review using multiple AI agents: Concept and an empirical evaluation