AutoResearch-RL: Perpetual Self-Evaluating Reinforcement Learning Agents for Autonomous Neural Architecture Discovery

📅 2026-03-07

📈 Citations: 0

✨ Influential: 0

career value

259K/year

🤖 AI Summary

This work proposes a fully automated, open-ended neural architecture discovery framework to address the reliance on manual design of neural architectures and hyperparameters. The approach formalizes architecture search as a Markov decision process, wherein a reinforcement learning agent autonomously modifies training scripts, executes experiments, and explores the search space under a fixed environment and time budget, using validation set bits-per-byte (val-bpb) as the reward signal. By decoupling the responsibilities of the environment, mutable target files, and the meta-learner, and integrating PPO optimization with a fixed data pipeline and an editable training script mechanism, the framework enables continuous, human-free optimization. On the single-GPU nanochat pretraining benchmark, it discovers configurations that match or surpass manually tuned baselines within approximately 300 iterations.

Technology Category

Application Category

📝 Abstract

We present AutoResearch-RL, a framework in which a reinforcement learning agent conducts open-ended neural architecture and hyperparameter research without human supervision, running perpetually until a termination oracle signals convergence or resource exhaustion. At each step the agent proposes a code modification to a target training script, executes it under a fixed wall clock time budget, observes a scalar reward derived from validation bits-per-byte (val-bpb), and updates its policy via Proximal Policy Optimisation (PPO). The key design insight is the separation of three concerns: (i) a frozen environment (data pipeline, evaluation protocol, and constants) that guarantees fair cross-experiment comparison; (ii) a mutable target file (train.py) that represents the agent's editable state; and (iii) a meta-learner (the RL agent itself) that accumulates a growing trajectory of experiment outcomes and uses them to inform subsequent proposals. We formalise this as a Markov Decision Process, derive convergence guarantees under mild assumptions, and demonstrate empirically on a single GPU nanochat pretraining benchmark that AutoResearch-RL discovers configurations that match or exceed hand-tuned baselines after approximately 300 overnight iterations, with no human in the loop.

Problem

Research questions and friction points this paper is trying to address.

neural architecture search

reinforcement learning

autonomous research

hyperparameter optimization

self-evaluating agents

Innovation

Methods, ideas, or system contributions that make the work stand out.

AutoResearch-RL

self-evaluating reinforcement learning

neural architecture discovery