🤖 AI Summary
This work proposes a fully automated, open-ended neural architecture discovery framework to address the reliance on manual design of neural architectures and hyperparameters. The approach formalizes architecture search as a Markov decision process, wherein a reinforcement learning agent autonomously modifies training scripts, executes experiments, and explores the search space under a fixed environment and time budget, using validation set bits-per-byte (val-bpb) as the reward signal. By decoupling the responsibilities of the environment, mutable target files, and the meta-learner, and integrating PPO optimization with a fixed data pipeline and an editable training script mechanism, the framework enables continuous, human-free optimization. On the single-GPU nanochat pretraining benchmark, it discovers configurations that match or surpass manually tuned baselines within approximately 300 iterations.
📝 Abstract
We present AutoResearch-RL, a framework in which a reinforcement learning agent conducts open-ended neural architecture and hyperparameter research without human supervision, running perpetually until a termination oracle signals convergence or resource exhaustion. At each step the agent proposes a code modification to a target training script, executes it under a fixed wall clock time budget, observes a scalar reward derived from validation bits-per-byte (val-bpb), and updates its policy via Proximal Policy Optimisation (PPO). The key design insight is the separation of three concerns: (i) a frozen environment (data pipeline, evaluation protocol, and constants) that guarantees fair cross-experiment comparison; (ii) a mutable target file (train.py) that represents the agent's editable state; and (iii) a meta-learner (the RL agent itself) that accumulates a growing trajectory of experiment outcomes and uses them to inform subsequent proposals. We formalise this as a Markov Decision Process, derive convergence guarantees under mild assumptions, and demonstrate empirically on a single GPU nanochat pretraining benchmark that AutoResearch-RL discovers configurations that match or exceed hand-tuned baselines after approximately 300 overnight iterations, with no human in the loop.