TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

📅 2024-12-18
🏛️ arXiv.org
📈 Citations: 22
Influential: 1
📄 PDF
🤖 AI Summary
This work addresses the challenge of evaluating large language model (LLM)-based agents on realistic workplace tasks. We introduce the first scalable benchmark platform grounded in the “digital worker” behavioral paradigm. The platform simulates a small software company’s digital work environment, supporting long-horizon, multi-step, multi-tool tasks—including web browsing, code generation and execution, and iterative collaborative communication—while enabling closed-loop interaction evaluation. Methodologically, it integrates a web browser, code sandbox, internal messaging system, and simulated enterprise websites, and establishes baseline agents using both proprietary and open-source LLMs. Experiments show that the best-performing agent completes 24% of professional tasks, demonstrating practical utility for short-horizon tasks but revealing substantial limitations in handling long-range, complex workflows. Key contributions include: (1) the first evaluation paradigm explicitly designed for digital workers; (2) a task suite aligned with authentic professional workflows; and (3) an end-to-end, reproducible, closed-loop evaluation framework.

Technology Category

Application Category

📝 Abstract
We interact with computers on an everyday basis, be it in everyday life or work, and many aspects of work can be done entirely with access to a computer and the Internet. At the same time, thanks to improvements in large language models (LLMs), there has also been a rapid development in AI agents that interact with and affect change in their surrounding environments. But how performant are AI agents at helping to accelerate or even autonomously perform work-related tasks? The answer to this question has important implications for both industry looking to adopt AI into their workflows, and for economic policy to understand the effects that adoption of AI may have on the labor market. To measure the progress of these LLM agents' performance on performing real-world professional tasks, in this paper, we introduce TheAgentCompany, an extensible benchmark for evaluating AI agents that interact with the world in similar ways to those of a digital worker: by browsing the Web, writing code, running programs, and communicating with other coworkers. We build a self-contained environment with internal web sites and data that mimics a small software company environment, and create a variety of tasks that may be performed by workers in such a company. We test baseline agents powered by both closed API-based and open-weights language models (LMs), and find that with the most competitive agent, 24% of the tasks can be completed autonomously. This paints a nuanced picture on task automation with LM agents -- in a setting simulating a real workplace, a good portion of simpler tasks could be solved autonomously, but more difficult long-horizon tasks are still beyond the reach of current systems.
Problem

Research questions and friction points this paper is trying to address.

Evaluating AI agents' performance on real-world professional tasks
Assessing automation potential of LLM agents in workplace settings
Measuring AI's capability to autonomously complete work-related tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmarking LLM agents on real-world tasks
Self-contained environment mimicking software company
Testing closed API and open-weights LMs
🔎 Similar Papers
No similar papers found.
F
Frank F. Xu
Carnegie Mellon University
Yufan Song
Yufan Song
Carnegie Mellon University
AI AgentsML System
Boxuan Li
Boxuan Li
Microsoft
Big DataLLMagent
Y
Yuxuan Tang
Carnegie Mellon University
K
Kritanjali Jain
Carnegie Mellon University
M
Mengxue Bao
Carnegie Mellon University
Z
Zora Z. Wang
Carnegie Mellon University
Xuhui Zhou
Xuhui Zhou
Carnegie Mellon University
Natural language processing
Z
Zhitong Guo
Carnegie Mellon University
M
Murong Cao
Independent
M
Mingyang Yang
Independent
H
Hao Yang Lu
Independent
A
Amaad Martin
Carnegie Mellon University
Z
Zhe Su
Carnegie Mellon University
L
Leander Maben
Carnegie Mellon University
R
Raj Mehta
Carnegie Mellon University
Wayne Chi
Wayne Chi
CMU
Machine LearningComputer Science
L
Lawrence Jang
Carnegie Mellon University
Yiqing Xie
Yiqing Xie
Carnegie Mellon University
Natural Language ProcessingCode GenerationFactuality
Shuyan Zhou
Shuyan Zhou
Duke University
Large Language ModelsAI Agent
Graham Neubig
Graham Neubig
Carnegie Mellon University, All Hands AI
Natural Language ProcessingMachine LearningArtificial Intelligence