🤖 AI Summary
This work addresses the challenges of excessive search space and difficulty in bottleneck identification when large language models (LLMs) automatically generate GPU kernels, stemming from a mismatch between the granularity of optimization knowledge and LLM reasoning. To bridge this gap, we propose the Hierarchical Transfer-Aware Memory (HTAM) framework, which organizes optimization knowledge into a two-level hierarchical transfer graph encompassing coarse-grained directions and fine-grained strategies. HTAM guides LLMs in generating efficient CUDA code through a state-aware mechanism that jointly selects global optimization directions and retrieves relevant local strategies. Evaluated on the full KernelBench suite, HTAM significantly improves correctness rate, the proportion of fast solutions, and speedup, while demonstrating strong generalization across backend configurations and the Robust-KBench benchmark.
📝 Abstract
High-performance GPU kernels are essential for efficient LLM deployment, yet optimizing them remains expertise-intensive. Recent LLM-based code generation makes automatic GPU operator generation promising, but operator optimization remains a hardware-aware search problem. Existing LLM-based methods face a granularity mismatch: coarse hints are reusable but hard to execute, whereas detailed memories are actionable but enlarge the search space and obscure optimization bottlenecks. The key challenge is therefore to organize optimization experience at an appropriate granularity. To address this issue, this paper proposes HTAM (Hierarchical Transition-Attended Memory), a coarse-to-fine framework for LLM-based operator optimization. HTAM builds a two-level Hierarchical Transition Graph (HTG) to organize coarse global directions, detailed local strategies, and transition experience between optimization steps. During each evolution step, HTAM selects a global direction from the current state and recent optimization history, retrieves the corresponding local strategy memory, and uses it to guide concrete CUDA code generation. Experiments on the full KernelBench suite demonstrate that HTAM consistently improves correctness, fast-solution rate, and speedup over LLM-based baselines, while backend and Robust-KBench studies indicate transferable benefits from structured memory.