🤖 AI Summary
Existing approaches struggle to dissect how self-evolving large language models attribute and integrate heterogeneous feedback signals into planning decisions during CUDA kernel generation, and conventional ablation studies fail to disentangle feedback effects from trajectory drift. To address this, this work proposes CUDAnalyst, an analytical framework that enables, for the first time, controlled, generation-level feedback attribution. By freezing execution trajectories and selectively injecting feedback, CUDAnalyst reveals structured relationships between feedback and high-level planning. The framework supports multi-feedback interaction modeling and cross-model plan transfer, demonstrating that high-level plans from stronger models can be partially transferred to weaker ones, thereby validating the universality of the feedback–planning mechanism. Experiments show that explicit planning is effective only when feedback is aligned, a finding robust across diverse models, workloads, and inductive settings.
📝 Abstract
Large language models (LLMs) have shown strong empirical gains as self-evolving agents for CUDA kernel generation, driven by feedback-conditioned planning across generations. However, how planning decisions attribute and combine heterogeneous feedback signals remains opaque. Standard end-to-end ablations fail to resolve this question, as iterative planning amplifies early perturbations and conflates feedback effects with trajectory-dependent drift.
We introduce \texttt{CUDAnalyst}, a unified analysis layer for controlled, generation-level attribution of planning decisions to feedback components via trajectory freezing and selective feedback injection. \texttt{CUDAnalyst} enables stable generation-level evaluation and principled coalitional-style attribution of feedback effects and interactions. Our results show that explicit planning is beneficial only when feedback is aligned, that effective planning emerges from structured multi-feedback interactions, and that high-level plans from stronger reasoning models can partially transfer to weaker ones. These trends hold across reference backbones, representative workloads, and reference induction regimes, indicating that the identified feedback-to-plan structure is robust within the controlled axes studied.