π€ AI Summary
This work addresses the tendency of large language models (LLMs) in automated AI research to generate plausible yet ineffective ideas due to the absence of execution-based validation and feedback. We propose the first execution-oriented closed-loop system for automated AI research, which concurrently executes LLM-generated research proposals in real GPU environments and optimizes strategies using evolutionary search and reinforcement learning guided by execution feedback. Experiments demonstrate that execution-guided evolutionary search discovers post-training methods significantly outperforming baselines (69.4% vs. 48.0%) and faster pretraining pipelines (19.7 vs. 35.9 minutes) within ten rounds. In contrast, reinforcement learning suffers from mode collapse, improving only average reward without surpassing performance ceilings. This study validates the critical role of execution feedback in enhancing the validity of research ideas and reveals distinct efficacy of different learning paradigms in this setting.
π Abstract
Automated AI research holds great potential to accelerate scientific discovery. However, current LLMs often generate plausible-looking but ineffective ideas. Execution grounding may help, but it is unclear whether automated execution is feasible and whether LLMs can learn from the execution feedback. To investigate these, we first build an automated executor to implement ideas and launch large-scale parallel GPU experiments to verify their effectiveness. We then convert two realistic research problems - LLM pre-training and post-training - into execution environments and demonstrate that our automated executor can implement a large fraction of the ideas sampled from frontier LLMs. We analyze two methods to learn from the execution feedback: evolutionary search and reinforcement learning. Execution-guided evolutionary search is sample-efficient: it finds a method that significantly outperforms the GRPO baseline (69.4% vs 48.0%) on post-training, and finds a pre-training recipe that outperforms the nanoGPT baseline (19.7 minutes vs 35.9 minutes) on pre-training, all within just ten search epochs. Frontier LLMs often generate meaningful algorithmic ideas during search, but they tend to saturate early and only occasionally exhibit scaling trends. Reinforcement learning from execution reward, on the other hand, suffers from mode collapse. It successfully improves the average reward of the ideator model but not the upper-bound, due to models converging on simple ideas. We thoroughly analyze the executed ideas and training dynamics to facilitate future efforts towards execution-grounded automated AI research.