STARK: Strategic Team of Agents for Refining Kernels

📅 2025-10-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
GPU kernel optimization is highly complex due to tight coupling among memory hierarchies, thread scheduling, and hardware-specific constraints; existing LLM-based approaches—relying on single-shot generation or simplistic iteration—struggle with multi-objective, co-dependent tuning in realistic scenarios. This paper proposes an LLM-powered multi-agent collaborative optimization framework that emulates expert engineers’ diagnostic–analytic–refactoring闭环. It integrates strategic search, dynamic context management, instruction-guided refinement, and hardware-aware reasoning, while establishing a performance-profilng feedback loop. The framework enables fine-grained task decomposition and context-adaptive evolution, substantially overcoming limitations of conventional LLM-driven optimization paradigms. Evaluated on KernelBench, it achieves a marked increase in correct solution generation rate, and the optimized kernels deliver up to 16× speedup over baseline implementations.

Technology Category

Application Category

📝 Abstract
The efficiency of GPU kernels is central to the progress of modern AI, yet optimizing them remains a difficult and labor-intensive task due to complex interactions between memory hierarchies, thread scheduling, and hardware-specific characteristics. While recent advances in large language models (LLMs) provide new opportunities for automated code generation, existing approaches largely treat LLMs as single-shot generators or naive refinement tools, limiting their effectiveness in navigating the irregular kernel optimization landscape. We introduce an LLM agentic framework for GPU kernel optimization that systematically explores the design space through multi-agent collaboration, grounded instruction, dynamic context management, and strategic search. This framework mimics the workflow of expert engineers, enabling LLMs to reason about hardware trade-offs, incorporate profiling feedback, and refine kernels iteratively. We evaluate our approach on KernelBench, a benchmark for LLM-based kernel optimization, and demonstrate substantial improvements over baseline agents: our system produces correct solutions where baselines often fail, and achieves kernels with up to 16x faster runtime performance. These results highlight the potential of agentic LLM frameworks to advance fully automated, scalable GPU kernel optimization.
Problem

Research questions and friction points this paper is trying to address.

Optimizing GPU kernels for AI efficiency
Navigating complex hardware-software interactions
Automating iterative kernel refinement with LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent collaboration explores kernel design space
Dynamic context management enables iterative kernel refinement
Strategic search incorporates profiling feedback for optimization
🔎 Similar Papers
No similar papers found.
J
Juncheng Dong
Meta Ranking AI Research
Y
Yang Yang
Meta Ranking AI Research
T
Tao Liu
Meta Ranking AI Research
Y
Yang Wang
Meta Ranking AI Research
Feng Qi
Feng Qi
Retired researcher
Special FunctionsAnalytic CombinatoricsAnalytic Number TheoryMathematical Inequalities
Vahid Tarokh
Vahid Tarokh
Duke University
Foundations of AI
Kaushik Rangadurai
Kaushik Rangadurai
Researcher at Meta
Machine LearningArtificial IntelligenceSearch
S
Shuang Yang
Meta Ranking AI Research