🤖 AI Summary
Terminal agents are hindered by the scarcity of high-quality training data, as existing methods produce tasks with narrow distributions, misaligned environment-task semantics, and inefficient exploration trajectories. This work proposes Terminal-World, the first framework to use skills as a unified compositional primitive for jointly synthesizing task instructions, environments, and instructional trajectories. By introducing skill teams and a skill graph mechanism, Terminal-World enables multi-agent, cross-domain task synthesis, substantially enhancing task diversity and complexity. Leveraging this framework, we construct 5,723 training environments and train the Terminal-World-8B/14B/32B model series, which consistently outperform existing terminal agents across six benchmarks. Notably, Terminal-World-32B achieves a Pass@1 score of 31.5 (+4.5) and Pass@3 of 43.8 on Terminal-Bench 2.0 using only 1.2% of the training data.
📝 Abstract
Terminal agents extend Large Language Models with the ability to execute tasks directly in command-line environments, but their progress is bottlenecked by the scarcity of high-quality training data. Existing approaches bootstrap from partial sources such as human-defined seeds or GitHub repositories to instantiate one component and then complete the rest, producing tasks confined to narrow seed distributions, environments misaligned with task semantics, and inefficient trajectories from unguided exploration. To address these limitations, we introduce Terminal-World, a fully automated pipeline that uses agent skills as the central synthesis primitive, which jointly encode what to accomplish, when to apply (preconditions and environment state), and how to execute, enabling task instructions, environments, and teacher trajectories to be co-derived. To further broaden the synthesis space, Terminal-World composes skills into skill teams and skill graphs for multi-role and cross-domain task synthesis. Using this pipeline, we construct 5,723 training environments and train Terminal-World-8B/14B/32B, evaluated across 6 benchmarks where the Terminal-World series consistently outperforms terminal-agent baselines. Notably, using the same teacher model and only 1.2% of the training data, Terminal-World-32B surpasses Nemotron-Terminal-32B on Terminal-Bench 2.0 by +4.5 Pass@1 (31.5) and achieves 43.8 Pass@3.