Towards Cold-Start Drafting and Continual Refining: A Value-Driven Memory Approach with Application to NPU Kernel Synthesis

📅 2026-03-11

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

This work addresses the cold-start problem faced by large language models when synthesizing high-performance kernels for data-scarce, specialized architectures such as NPUs. To overcome this challenge, the authors propose EvoKernel, a framework that formulates kernel synthesis as a memory-augmented reinforcement learning task. EvoKernel introduces a phase-aware Q-value estimation mechanism and a value-driven memory retrieval strategy, enabling cross-operator experience transfer. This allows a general-purpose model to generate efficient kernels for niche hardware without requiring fine-tuning. Experimental results demonstrate that EvoKernel improves code correctness on NPUs from 11.0% to 83.0% and achieves an average latency speedup of 3.60× through iterative optimization.

Technology Category

Application Category

📝 Abstract

Deploying Large Language Models to data-scarce programming domains poses significant challenges, particularly for kernel synthesis on emerging Domain-Specific Architectures where a "Data Wall" limits available training data. While models excel on data-rich platforms like CUDA, they suffer catastrophic performance drops on data-scarce ecosystems such as NPU programming. To overcome this cold-start barrier without expensive fine-tuning, we introduce EvoKernel, a self-evolving agentic framework that automates the lifecycle of kernel synthesis from initial drafting to continual refining. EvoKernel addresses this by formulating the synthesis process as a memory-based reinforcement learning task. Through a novel value-driven retrieval mechanism, it learns stage-specific Q-values that prioritize experiences based on their contribution to the current objective, whether bootstrapping a feasible draft or iteratively refining latency. Furthermore, by enabling cross-task memory sharing, the agent generalizes insights from simple to complex operators. By building an NPU variant of KernelBench and evaluating on it, EvoKernel improves frontier models' correctness from 11.0% to 83.0% and achieves a median speedup of 3.60x over initial drafts through iterative refinement. This demonstrates that value-guided experience accumulation allows general-purpose models to master the kernel synthesis task on niche hardware ecosystems. Our official page is available at https://evokernel.zhuo.li.

Problem

Research questions and friction points this paper is trying to address.

cold-start

kernel synthesis

data-scarce

NPU

Domain-Specific Architectures

Innovation

Methods, ideas, or system contributions that make the work stand out.

value-driven memory

cold-start kernel synthesis

continual refinement