TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks

📅 2026-05-21
📈 Citations: 0
Influential: 0
📄 PDF

career value

211K/year
🤖 AI Summary
Current agent capabilities in real-world terminal tasks lack scalable, high-fidelity evaluation benchmarks. This work proposes TerminalWorld, a novel benchmark built upon an automated reverse-engineering engine leveraging over 80,000 real terminal session recordings. It encompasses 18 scenario categories and 1,280 distinct commands, with a human-verified subset of 200 high-quality tasks (TerminalWorld-Verified). Unlike expert-crafted benchmarks, TerminalWorld exhibits strong authenticity and evolutionary adaptability. Evaluation reveals that state-of-the-art agents achieve only a 62.5% pass rate on TerminalWorld-Verified, and their performance shows weak correlation with scores on existing benchmarks (Pearson r = 0.20), underscoring the unique challenges inherent in assessing real-world terminal proficiency.
📝 Abstract
We introduce TerminalWorld, a scalable data engine that automatically reverse-engineers high-fidelity evaluation tasks from "in-the-wild" terminal recordings. Processing 80,870 terminal recordings, the engine yields a full benchmark of 1,530 validated tasks, spanning 18 real-world categories, ranging from short everyday operations to workflows exceeding 50 steps, and covering 1,280 unique commands. From these, we curate a Verified subset of 200 representative, manually reviewed tasks. Comprehensive benchmarking on TerminalWorld-Verified across eight frontier models and six agents reveals that current systems still struggle with authentic terminal workflows, achieving a maximum pass rate of only 62.5%. Moreover, TerminalWorld captures real-world terminal capabilities distinct from existing expert-curated benchmarks (e.g., Terminal-Bench), with only a weak correlation to their scores (Pearson r=0.20). The automated engine makes TerminalWorld authentic and scalable by construction, enabling it to evaluate agents in real-world terminal environments as developer practices evolve. Data and code are available at https://github.com/EuniAI/TerminalWorld.
Problem

Research questions and friction points this paper is trying to address.

terminal tasks
agent benchmarking
real-world evaluation
command-line workflows
scalable benchmark
Innovation

Methods, ideas, or system contributions that make the work stand out.

terminal task benchmarking
automated data engine
real-world agent evaluation
reverse-engineered tasks
scalable benchmark
🔎 Similar Papers
No similar papers found.