🤖 AI Summary
Existing benchmarks for terminal-based agents lack scalable reinforcement learning environments, hindering their capacity for autonomous evolution. This work proposes the first fully automated, annotation-free pipeline for generating terminal tasks, comprising four stages: task description synthesis, containerized environment construction, completion test generation, and solvability filtering—enabling large-scale, scalable training environments. The approach relies solely on procedural generation, container isolation, binary episodic rewards, and standard PPO algorithms, without requiring retrieval mechanisms, multi-agent collaboration, or specialized tools. Experiments demonstrate substantial performance gains on a self-constructed test set (e.g., Qwen2.5-7B improves from 10.7% to 53.3%) and state-of-the-art results on human-curated benchmarks such as TerminalBench 2.0, outperforming more complex architectures.
📝 Abstract
Environments are the bottleneck for self-improving agents. Current terminal benchmarks were built for evaluation, not training; reinforcement learning requires a scalable pipeline, not just a dataset. We introduce Endless Terminals, a fully autonomous pipeline that procedurally generates terminal-use tasks without human annotation. The pipeline has four stages: generating diverse task descriptions, building and validating containerized environments, producing completion tests, and filtering for solvability. From this pipeline we obtain 3255 tasks spanning file operations, log management, data processing, scripting, and database operations. We train agents using vanilla PPO with binary episode level rewards and a minimal interaction loop: no retrieval, multi-agent coordination, or specialized tools. Despite this simplicity, models trained on Endless Terminals show substantial gains: on our held-out dev set, Llama-3.2-3B improves from 4.0% to 18.2%, Qwen2.5-7B from 10.7% to 53.3%, and Qwen3-8B-openthinker-sft from 42.6% to 59.0%. These improvements transfer to human-curated benchmarks: models trained on Endless Terminals show substantial gains on held out human curated benchmarks: on TerminalBench 2.0, Llama-3.2-3B improves from 0.0% to 2.2%, Qwen2.5-7B from 2.2% to 3.4%, and Qwen3-8B-openthinker-sft from 1.1% to 6.7%, in each case outperforming alternative approaches including models with more complex agentic scaffolds. These results demonstrate that simple RL succeeds when environments scale.