Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

📅 2026-01-17
📈 Citations: 9
Influential: 1
📄 PDF
🤖 AI Summary
This work addresses the challenge that existing AI agent benchmarks inadequately evaluate performance on real-world, complex, and long-horizon command-line tasks. To bridge this gap, the authors introduce a novel evaluation benchmark comprising 89 high-difficulty terminal tasks, all derived from authentic workflows and accompanied by isolated execution environments, human-authored reference solutions, and automated verification tests. The benchmark is designed to ensure realism, verifiability, and diversity, substantially narrowing the disparity between practical scenarios and current model evaluation paradigms. Experimental results demonstrate that even state-of-the-art agents achieve success rates below 65% on this benchmark. The paper further provides comprehensive error analysis and publicly releases the dataset and evaluation toolchain to support future research in this domain.

Technology Category

Application Category

📝 Abstract
AI agents may soon become capable of autonomously completing valuable, long-horizon tasks in diverse domains. Current benchmarks either do not measure real-world tasks, or are not sufficiently difficult to meaningfully measure frontier models. To this end, we present Terminal-Bench 2.0: a carefully curated hard benchmark composed of 89 tasks in computer terminal environments inspired by problems from real workflows. Each task features a unique environment, human-written solution, and comprehensive tests for verification. We show that frontier models and agents score less than 65\% on the benchmark and conduct an error analysis to identify areas for model and agent improvement. We publish the dataset and evaluation harness to assist developers and researchers in future work at https://www.tbench.ai/ .
Problem

Research questions and friction points this paper is trying to address.

AI agents
benchmarking
command line interfaces
real-world tasks
long-horizon tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Terminal-Bench
AI agents
command-line interface
realistic tasks
benchmarking
🔎 Similar Papers
No similar papers found.
Mike A. Merrill
Mike A. Merrill
Postdoc, Stanford University
language modelsagents
A
Alexander G. Shaw
Laude Institute
Nicholas Carlini
Nicholas Carlini
Anthropic
Boxuan Li
Boxuan Li
Microsoft
Big DataLLMagent
Harsh Raj
Harsh Raj
Vijil AI
Machine LearningNatural Language ProcessingGenerative NetworksVision-and-Languge models
Ivan Bercovich
Ivan Bercovich
University of California Santa Barbara
LLMsInformation Retrieval
Lin Shi
Lin Shi
Beihang University
Software Engineering
J
Jeong Yeon Shin
Cornell University, Snorkel AI
T
Thomas Walshe
Reflection AI
E
E. K. Buchanan
Stanford University
Junhong Shen
Junhong Shen
Ph.D. student in Machine Learning, Carnegie Mellon University
G
Guanghao Ye
Massachusetts Institute of Technology
Haowei Lin
Haowei Lin
Peking University
LLMAI4Science
J
Jason Poulos
Independent
M
Maoyu Wang
Independent
M
Marianna Nezhurina
LAION, JSC, FZJ
J
J. Jitsev
LAION, JSC, FZJ
D
Di Lu
Tencent
O
O. M. Mastromichalakis
National Technical University of Athens, Nerion
Zhiwei Xu
Zhiwei Xu
PhD student, University of Michigan
Machine LearningDeep LearningArtificial Intelligence
Z
Zizhao Chen
Cornell University
Yue Liu
Yue Liu
National University of Singapore
Self-Supervised LearningLarge Language ModelGraph Neural Network
Robert Zhang
Robert Zhang
The University of Texas at Austin
Programming Languages
L
Leon Liangyu Chen
Stanford University
A
Anurag Kashyap
Amazon
J
Jan-Lucas Uslu
Stanford University
Jeffrey Li
Jeffrey Li
University of Washington
Machine Learning
Jianbo Wu
Jianbo Wu
Unknown affiliation
Agent EvaluationEnvironmental Interaction
M
Minghao Yan
University of Wisconsin-Madison
S
Song Bian
University of Wisconsin-Madison
V
Vedang Sharma
Independent
K
Ke Sun
Independent
Steven Dillmann
Steven Dillmann
Stanford University, University of Cambridge
AI for ScienceMachine LearningData Driven DiscoveryComputational Mathematics
A
Akshay Anand
University of California, Berkeley
A
Andrew Lanpouthakoun
Stanford University
B
Bardia Koopah
University of California, Berkeley
Changran Hu
Changran Hu
University of California, Berkeley
LLMlong contextAgentic AIPost Training
E
E. Guha
Stanford University, University of Washington
G
Gabriel H. S. Dreiman
Independent
Jiacheng Zhu
Jiacheng Zhu
MIT
Machine LearningFoundation ModelsOptimal TransportBayesian modeling
Karl Krauth
Karl Krauth
Postdoc, Stanford
machine learningstatisticsoptimization
Li Zhong
Li Zhong
High Performance Computing center Stuttgart (HLRS)
Big dataMachine LearningDeep LearningHPC
Niklas Muennighoff
Niklas Muennighoff
Stanford University
large language modelsartificial intelligencemachine learning
R
Robert K. Amanfu
Independent
Shangyin Tan
Shangyin Tan
PhD Student, UC Berkeley
Program AnalysisProgramming LanguagesCompilers
S
Shreyas Pimpalgaonkar
Bespoke Labs
Tushar Aggarwal
Tushar Aggarwal
Research Fellow, Microsoft Research
AI4CodeSoftware Engineering
X
Xiangning Lin
Carnegie Mellon University
X
Xin Lan
Michigan State University
Xuandong Zhao
Xuandong Zhao
UC Berkeley
Machine LearningNatural Language ProcessingAI Safety
Yiqing Liang
Yiqing Liang
Brown University
Multimodal Generative AI
Yuanli Wang
Yuanli Wang
Boston University
Distributed SystemsMLSysLarge Language ModelsAgentic AI
Z
Zilong Wang
University of California, San Diego
C
Changzhi Zhou
Beijing Institute of Technology
D
David Heineman
Allen Institute for AI
H
Hange Liu
Independent
H
Harsh Trivedi
Allen Institute for AI
John Yang
John Yang
Stanford University
Machine LearningNatural Language ProcessingProgramming LanguagesSoftware Engineering
J
Junhong Lin
Massachusetts Institute of Technology
Manish Shetty
Manish Shetty
University of California, Berkeley
AI for Code
M
Michael Yang
University of California, Santa Barbara
N
Nabil Omi
University of Washington
N
Negin Raoof
University of California, Berkeley
Shanda Li
Shanda Li
Carnegie Mellon University
Machine Learning
Terry Yue Zhuo
Terry Yue Zhuo
Researcher
Large Language ModelsCode GenerationAI4SECybersecurity
Wuwei Lin
Wuwei Lin
OpenAI
Machine Learning Systems
Y
Yiwei Dai
Cornell University
Yuxin Wang
Yuxin Wang
Dartmouth College
NLPNatural Language UnderstandingKnowledge Representation and Reasoning
Wenhao Chai
Wenhao Chai
Princeton University
Machine LearningComputer Vision
Shang Zhou
Shang Zhou
PhD Student, University of California, San Diego
D
Dariush Wahdany
CISPA
Z
Ziyu She
University of Basel
Jiaming Hu
Jiaming Hu
Boston University
Deep LearningOptimization
Z
Zhikang Dong
Stony Brook University
Yuxuan Zhu
Yuxuan Zhu
PhD student, University of Illinois Urbana-Champaign
Data systemsAI evaluation
S
Sasha Cui
Yale University
A
Ahson Saiyed
University of Virginia
Arinbjörn Kolbeinsson
Arinbjörn Kolbeinsson
University of Virginia
Machine LearningBiomedical Data ScienceTensor methods
J
Jesse Hu
Abundant
C
Christopher Rytting
Laude Institute
R
Ryan Marten
Bespoke Labs
Yixin Wang
Yixin Wang
University of Michigan
Bayesian statisticsMachine Learning
A
Alexandros G. Dimakis
University of California, Berkeley, Bespoke Labs
A
A. Konwinski
Laude Institute
Ludwig Schmidt
Ludwig Schmidt
Stanford University and Anthropic
Machine LearningArtificial IntelligenceOptimizationAlgorithmsStatistics