QiMeng-Kernel: Macro-Thinking Micro-Coding Paradigm for LLM-Based High-Performance GPU Kernel Generation

📅 2025-11-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
GPU kernel development relies heavily on expert manual coding, suffers from poor portability, and existing LLM-based generation methods fundamentally trade off correctness against efficiency. Method: This paper proposes the Macro-Strategy Thinking & Micro-Code Generation (MTMC) framework, introducing a novel “macro-strategy reasoning–micro-code generation” co-design paradigm: a top-level reinforcement learning agent searches semantically equivalent optimization strategies, while a lightweight, general-purpose LLM collaboratively constructs incremental, formally verifiable kernel code—decoupling strategy design from low-level implementation. Results: On KernelBench, MTMC achieves 99.8% functional correctness—exceeding state-of-the-art models by over 50 percentage points—and delivers up to 7.3× speedup. On TritonBench, it attains 59.64% correctness and outperforms expert-written kernels by 34× in execution speed, markedly breaking the end-to-end correctness–efficiency bottleneck in LLM-driven kernel synthesis.

Technology Category

Application Category

📝 Abstract
Developing high-performance GPU kernels is critical for AI and scientific computing, but remains challenging due to its reliance on expert crafting and poor portability. While LLMs offer promise for automation, both general-purpose and finetuned LLMs suffer from two fundamental and conflicting limitations: correctness and efficiency. The key reason is that existing LLM-based approaches directly generate the entire optimized low-level programs, requiring exploration of an extremely vast space encompassing both optimization policies and implementation codes. To address the challenge of exploring an intractable space, we propose Macro Thinking Micro Coding (MTMC), a hierarchical framework inspired by the staged optimization strategy of human experts. It decouples optimization strategy from implementation details, ensuring efficiency through high-level strategy and correctness through low-level implementation. Specifically, Macro Thinking employs reinforcement learning to guide lightweight LLMs in efficiently exploring and learning semantic optimization strategies that maximize hardware utilization. Micro Coding leverages general-purpose LLMs to incrementally implement the stepwise optimization proposals from Macro Thinking, avoiding full-kernel generation errors. Together, they effectively navigate the vast optimization space and intricate implementation details, enabling LLMs for high-performance GPU kernel generation. Comprehensive results on widely adopted benchmarks demonstrate the superior performance of MTMC on GPU kernel generation in both accuracy and running time. On KernelBench, MTMC achieves near 100% and 70% accuracy at Levels 1-2 and 3, over 50% than SOTA general-purpose and domain-finetuned LLMs, with up to 7.3x speedup over LLMs, and 2.2x over expert-optimized PyTorch Eager kernels. On the more challenging TritonBench, MTMC attains up to 59.64% accuracy and 34x speedup.
Problem

Research questions and friction points this paper is trying to address.

Automating high-performance GPU kernel generation for AI
Resolving correctness-efficiency conflict in LLM-based approaches
Navigating vast optimization space through hierarchical decoupling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical framework decouples optimization from implementation
Reinforcement learning guides lightweight LLMs for strategy exploration
General-purpose LLMs incrementally implement stepwise optimization proposals
🔎 Similar Papers
No similar papers found.
X
Xinguo Zhu
Intelligent Software Research Center, Institute of Software, CAS, Beijing, China
Shaohui Peng
Shaohui Peng
Institute of Software Chinese Academy of Sciences
Embodied AIReinforcement Learning
Jiaming Guo
Jiaming Guo
Institute of Computing Technology, Chinese Academy of Sciences
Artificial intelligenceReinforcement Learning
Yunji Chen
Yunji Chen
Institute of Computing Technology, Chinese Academy of Sciences
processor architecturemicroarchitecturemachine learning
Q
Qi Guo
State Key Lab of Processors, Institute of Computing Technology, CAS, Beijing, China
Yuanbo Wen
Yuanbo Wen
Institute of Computing Technology, Chinese Academy of Sciences
Machine Learning System
Hang Qin
Hang Qin
Intelligent Software Research Center, Institute of Software, CAS, Beijing, China
Ruizhi Chen
Ruizhi Chen
Intelligent Software Research Center, Institute of Software, CAS, Beijing, China
Q
Qirui Zhou
State Key Lab of Processors, Institute of Computing Technology, CAS, Beijing, China
K
Ke Gao
Intelligent Software Research Center, Institute of Software, CAS, Beijing, China
Yanjun Wu
Yanjun Wu
Institute of Software, Chinese Academy of Sciences
Computer Science
C
Chen Zhao
Intelligent Software Research Center, Institute of Software, CAS, Beijing, China
L
Ling Li
Intelligent Software Research Center, Institute of Software, CAS, Beijing, China