🤖 AI Summary
Kernel optimization for emerging AI accelerators heavily relies on expert hardware knowledge, hindering automation and scalability. Method: This paper proposes a self-evolving optimization framework powered by large language model (LLM) agents. It constructs a “slow–fast kernel verification” experience memory bank and employs iterative generation–feedback loops to achieve end-to-end automatic tuning—without requiring domain-specific hardware expertise. The framework is trained and evaluated on the NKIBench benchmark. Results: On real LLM workloads, it achieves average peak throughput of 61% (+12 percentage points) and 59% (+14 percentage points) on Trainium 1 and Trainium 2, respectively—substantially outperforming baselines. Implemented with open-source LLMs, its cost is only 1/26 that of Claude Sonnet 4. Contribution: This work pioneers the integration of LLM agents, self-evolving memory, and accelerator kernel optimization, establishing a novel paradigm for autonomous AI system optimization.
📝 Abstract
We present AccelOpt, a self-improving large language model (LLM) agentic system that autonomously optimizes kernels for emerging AI acclerators, eliminating the need for expert-provided hardware-specific optimization knowledge. AccelOpt explores the kernel optimization space through iterative generation, informed by an optimization memory that curates experiences and insights from previously encountered slow-fast kernel pairs. We build NKIBench, a new benchmark suite of AWS Trainium accelerator kernels with varying complexity extracted from real-world LLM workloads to evaluate the effectiveness of AccelOpt. Our evaluation confirms that AccelOpt's capability improves over time, boosting the average percentage of peak throughput from $49%$ to $61%$ on Trainium 1 and from $45%$ to $59%$ on Trainium 2 for NKIBench kernels. Moreover, AccelOpt is highly cost-effective: using open-source models, it matches the kernel improvements of Claude Sonnet 4 while being $26 imes$ cheaper.