🤖 AI Summary
This study investigates whether large language models (LLMs) rely primarily on pre-trained priors or environmental feedback when performing hardware-aware code optimization. Through three controlled experiments—black-box optimization, zero-shot kernel generation, and iterative feedback-driven refinement—the authors systematically evaluate LLM behavior using both CUDA and TVM intermediate representations (IRs). The work reveals, for the first time, that LLMs predominantly leverage pre-trained knowledge rather than external feedback to guide optimizations, effectively acting as greedy optimizers in the absence of feedback. Notably, providing input size information yields no measurable benefit, and model performance degrades significantly under TVM IR due to its sparse representation, whereas it monotonically improves with CUDA. These findings demonstrate that pre-trained priors dominate the optimization process and that low-density IRs substantially impair LLM effectiveness.
📝 Abstract
LLM discovery and optimization systems are increasingly applied across domains, implementing a common propose-evaluate-revise loop. Such optimization or discovery progresses via context conditioning on received feedback from an environment. However, as modern LLM agents are increasingly complex in their structure, it is difficult to evaluate which components contribute the most, and when and how this exploration may fail. We answer these questions through three controlled experiments. Our findings: (1) In pure black-box optimization, LLMs act as greedy optimizers. (2) In zero-shot kernel generation, providing explicit input-size information has no measurable effect, models converge to the same kernel parameters regardless of size or temperature, as though the size instruction were invisible. Moreover, when tasked to perform kernel optimization for uncommon kernel sizes, performance sharply degrades regardless of the language used. (3) In feedback-loop kernel optimization, CUDA improves monotonically under iterative feedback, while TVM IR actively degrades, which demonstrates that kernel optimization degrades when models operate with low-density language. Our results conclude that LLMs in code optimization tasks highly depend on pretrained priors rather than provided feedback or agentic structure.