AutoResearch-RL: Perpetual Self-Evaluating Reinforcement Learning Agents for Autonomous Neural Architecture Discovery

📅 2026-03-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes a fully automated, open-ended neural architecture discovery framework to address the reliance on manual design of neural architectures and hyperparameters. The approach formalizes architecture search as a Markov decision process, wherein a reinforcement learning agent autonomously modifies training scripts, executes experiments, and explores the search space under a fixed environment and time budget, using validation set bits-per-byte (val-bpb) as the reward signal. By decoupling the responsibilities of the environment, mutable target files, and the meta-learner, and integrating PPO optimization with a fixed data pipeline and an editable training script mechanism, the framework enables continuous, human-free optimization. On the single-GPU nanochat pretraining benchmark, it discovers configurations that match or surpass manually tuned baselines within approximately 300 iterations.

Technology Category

Application Category

📝 Abstract
We present AutoResearch-RL, a framework in which a reinforcement learning agent conducts open-ended neural architecture and hyperparameter research without human supervision, running perpetually until a termination oracle signals convergence or resource exhaustion. At each step the agent proposes a code modification to a target training script, executes it under a fixed wall clock time budget, observes a scalar reward derived from validation bits-per-byte (val-bpb), and updates its policy via Proximal Policy Optimisation (PPO). The key design insight is the separation of three concerns: (i) a frozen environment (data pipeline, evaluation protocol, and constants) that guarantees fair cross-experiment comparison; (ii) a mutable target file (train.py) that represents the agent's editable state; and (iii) a meta-learner (the RL agent itself) that accumulates a growing trajectory of experiment outcomes and uses them to inform subsequent proposals. We formalise this as a Markov Decision Process, derive convergence guarantees under mild assumptions, and demonstrate empirically on a single GPU nanochat pretraining benchmark that AutoResearch-RL discovers configurations that match or exceed hand-tuned baselines after approximately 300 overnight iterations, with no human in the loop.
Problem

Research questions and friction points this paper is trying to address.

neural architecture search
reinforcement learning
autonomous research
hyperparameter optimization
self-evaluating agents
Innovation

Methods, ideas, or system contributions that make the work stand out.

AutoResearch-RL
self-evaluating reinforcement learning
neural architecture discovery
perpetual autonomous research
Proximal Policy Optimisation
🔎 Similar Papers
No similar papers found.
Nilesh Jain
Nilesh Jain
Principal Engineer - AI Systems, Intel Labs
AI/ML OptimizationsAI Systems & SolutionsEdge/Cloud ComputingEmerging Applications
R
Rohit Yadav
Google Cloud, Stanford, UC Berkeley
S
Sagar Kotian
MIT, Meta, IIT Bombay
C
Claude AI
Deepmind