🤖 AI Summary
Existing language agents exhibit limited multi-step planning and state adaptation capabilities in terminal environments due to reliance on externally scraped data, which suffers from narrow domain coverage, lack of environmental control, and difficulty in targeted optimization. This work proposes LiteCoder-Terminal-Gen, a novel framework that introduces the first zero-dependency method for automatically generating terminal environments. It synthesizes executable and verifiable training environments solely from domain specifications, enabling the construction of large-scale supervised and reinforcement learning datasets. By combining supervised fine-tuning with Qwen-series models and Direct Multi-turn Preference Optimization (DMPO), the resulting 32B model achieves pass@1 performance of 29.06%, 18.54%, and 34.00% on Terminal Bench 1.0, 2.0, and Pro, respectively, strongly validating the effectiveness and scalability of an entirely synthetic training paradigm.
📝 Abstract
Mastering terminal environments requires language agents capable of multi-step planning, feedback-grounded execution, and dynamic state adaptation. However, training such agents is currently bottlenecked by a reliance on scraped external repositories, which limits domain diversity, environment controllability, and the targeting of specific capability deficits. We introduce LiteCoder-Terminal-Gen, a zero-dependency synthesis pipeline that autonomously generates executable and verifiable terminal training environments directly from domain specifications. Using this framework, we construct two large-scale resources: LiteCoder-Terminal-SFT, comprising 11,255 expert trajectories across 10 domains, and LiteCoder-Terminal-RL, featuring 602 verifiable environments for trajectory-level preference optimization. Supervised fine-tuning of Qwen-family models on our SFT dataset yields agents that significantly outperform their base counterparts. Notably, our 32B variant achieves 29.06%, 18.54%, and 34.00% pass@1 on Terminal Bench 1.0, 2.0, and Pro, respectively. Furthermore, applying Direct Multi-turn Preference Optimization (DMPO) on our RL environments yields additional performance gains. These results systematically demonstrate that fully synthetic, executable environments offer a scalable and verifiable supervision signal for mastering complex, real-world command-line workflows.