Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tasks with Large-Scale File Dependencies

πŸ“… 2026-05-05
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

224K/year
πŸ€– AI Summary
Existing benchmarks inadequately assess AI agents’ ability to manage explicit and implicit dependencies across large-scale, heterogeneous files in realistic work environments. To address this gap, this work introduces Workspace-Bench, the first systematically constructed benchmark comprising five worker profiles, 74 file types, and over 20,000 files (up to 20 GB each), along with 388 tasks annotated with detailed file dependency graphs. The benchmark enables multidimensional evaluation of cross-file retrieval, contextual reasoning, and adaptive decision-making. We further propose dependency-graph-based task modeling, a multi-granularity scoring scheme, and a lightweight subset, Workspace-Bench-Lite, to reduce evaluation costs. Experimental results reveal that even state-of-the-art agents achieve only 68.7% performance (averaging 47.4%), substantially lagging behind human performance at 80.7%, highlighting a critical deficiency in current AI systems’ understanding of complex workspaces.
πŸ“ Abstract
Workspace learning requires AI agents to identify, reason over, exploit, and update explicit and implicit dependencies among heterogeneous files in a worker's workspace, enabling them to complete both routine and advanced tasks effectively. Despite its importance, existing relevant benchmarks largely evaluate agents on pre-specified or synthesized files with limited real-world dependencies, leaving workspace-level evaluation underexplored. To this end, we introduce Workspace-Bench, a benchmark for evaluating AI agents on Workspace Learning invOlving Large-Scale File Dependencies. We construct realistic workspaces with 5 worker profiles, 74 file types, 20,476 files (up to 20GB) and curate 388 tasks, each with its own file dependency graph, evaluated across 7,399 total rubrics that require cross-file retrieval, contextual reasoning, and adaptive decision-making. We further provide Workspace-Bench-Lite, a 100-task subset that preserves the benchmark distribution while reducing evaluation costs by about 70%. We evaluate 4 popular agent harnesses and 7 foundation models. Experimental results show that current agents remain far from reliable workspace learning, where the best reaches only 68.7%, substantially below the human result of 80.7%, and the average performance across agents is only 47.4%.
Problem

Research questions and friction points this paper is trying to address.

workspace learning
file dependencies
AI agents
benchmarking
realistic workspaces
Innovation

Methods, ideas, or system contributions that make the work stand out.

workspace learning
file dependencies
AI agent benchmarking
cross-file reasoning
realistic task evaluation
πŸ”Ž Similar Papers
Z
Zirui Tang
Shanghai Jiao Tong University
Xuanhe Zhou
Xuanhe Zhou
Assistant Professor, Shanghai Jiao Tong University
Data ManagementArtificial Intelligence
Y
Yumou Liu
Shanghai Jiao Tong University
L
Linchun Li
Shanghai Jiao Tong University
Weizheng Wang
Weizheng Wang
Hong Kong Polytechnic University
Information SecurityApplied CryptographyBlockchain
H
Hongzhang Huang
Shanghai Jiao Tong University
J
Jun Zhou
Shanghai Jiao Tong University
J
Jiachen Song
Shanghai Jiao Tong University
S
Shaoli Yu
Shanghai Jiao Tong University
J
Jinqi Wang
Shanghai Jiao Tong University
Z
Zihang Zhou
Shanghai Jiao Tong University
Hongyi Zhou
Hongyi Zhou
Karlsruhe Institute of Technology
reinforcement learningimitation learningrobotics
Y
Yuting Lv
ByteDance
J
Jinyang Li
Independent Researcher
Jiashuo Liu
Jiashuo Liu
Tsinghua University
Robust OptimizationOOD GeneralizationData-Centric AI
Ruoyu Chen
Ruoyu Chen
Institute of Information Engineering, Chinese Academy of Sciences.
Explainable AITrustworthy AIFoundation Model
Chunwei Liu
Chunwei Liu
Massachusetts Institute of Technology
DatabasesCompound AI SystemsLLMData CompressionIoT
G
GuoLiang Li
Tsinghua University
J
Jihua Kang
ByteDance
Fan Wu
Fan Wu
Professor, Department of Computer Science and Engineering, Shanghai Jiao Tong University
Wireless NetworkingMobile ComputingAlgorithmic Game Theory and Its Applications