On Data Engineering for Scaling LLM Terminal Capabilities

📅 2026-02-24

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This work addresses the lack of systematic research and publicly available resources regarding training data strategies for terminal-based large language models. To this end, we propose Terminal-Task-Gen, a lightweight synthetic task generation framework that supports both seed- and skill-driven task construction. We present the first systematic analysis of how data filtering, curriculum learning, long-context training, and behavioral augmentation impact performance on terminal tasks. Leveraging this framework, we construct Terminal-Corpus, a large-scale open-source dataset used to train the Nemotron-Terminal model series (initialized from Qwen3). On Terminal-Bench 2.0, the 32B variant achieves a substantial improvement from 3.4% to 27.4% in performance, rivaling significantly larger models. All code, models, and datasets are publicly released.

Technology Category

Application Category

📝 Abstract

Despite rapid recent progress in the terminal capabilities of large language models, the training data strategies behind state-of-the-art terminal agents remain largely undisclosed. We address this gap through a systematic study of data engineering practices for terminal agents, making two key contributions: (1) Terminal-Task-Gen, a lightweight synthetic task generation pipeline that supports seed-based and skill-based task construction, and (2) a comprehensive analysis of data and training strategies, including filtering, curriculum learning, long context training, and scaling behavior. Our pipeline yields Terminal-Corpus, a large-scale open-source dataset for terminal tasks. Using this dataset, we train Nemotron-Terminal, a family of models initialized from Qwen3(8B, 14B, 32B) that achieve substantial gains on Terminal-Bench 2.0: Nemotron-Terminal-8B improves from 2.5% to 13.0% Nemotron-Terminal-14B improves from 4.0% to 20.2%, and Nemotron-Terminal-32B improves from 3.4% to 27.4%, matching the performance of significantly larger models. To accelerate research in this domain, we open-source our model checkpoints and most of our synthetic datasets at https://huggingface.co/collections/nvidia/nemotron-terminal.

Problem

Research questions and friction points this paper is trying to address.

data engineering

large language models

terminal agents

training data strategies

synthetic task generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

synthetic task generation

data engineering

terminal agents