ProSoftArena: Benchmarking Hierarchical Capabilities of Multimodal Agents in Professional Software Environments

📅 2025-12-30
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing benchmarks struggle to evaluate the true capabilities of multimodal agents in professional software environments. To address this gap, this work introduces the first benchmarking platform specifically designed for assessing multimodal agents in professional software contexts. It proposes a hierarchical capability framework encompassing 436 real-world tasks across 13 core applications in six disciplines, supported by an executable environment and a human-in-the-loop evaluation mechanism. This study establishes the first capability hierarchy for agent performance in professional software usage and pioneers a hybrid evaluation paradigm that combines execution-based automatic scoring with human-in-the-loop assessment, thereby filling a critical void in evaluating agents within professional workflows. Experiments reveal that even state-of-the-art agents achieve only a 24.4% success rate on Level-2 tasks and completely fail on Level-3 cross-software workflow tasks, highlighting their significant limitations in complex, collaborative professional operations.

Technology Category

Application Category

📝 Abstract
Multimodal agents are making rapid progress on general computer-use tasks, yet existing benchmarks remain largely confined to browsers and basic desktop applications, falling short in professional software workflows that dominate real-world scientific and industrial practice. To close this gap, we introduce ProSoftArena, a benchmark and platform specifically for evaluating multimodal agents in professional software environments. We establish the first capability hierarchy tailored to agent use of professional software and construct a benchmark of 436 realistic work and research tasks spanning 6 disciplines and 13 core professional applications. To ensure reliable and reproducible assessment, we build an executable real-computer environment with an execution-based evaluation framework and uniquely incorporate a human-in-the-loop evaluation paradigm. Extensive experiments show that even the best-performing agent attains only a 24.4\% success rate on L2 tasks and completely fails on L3 multi-software workflow. In-depth analysis further provides valuable insights for addressing current agent limitations and more effective design principles, paving the way to build more capable agents in professional software settings. This project is available at: https://prosoftarena.github.io.
Problem

Research questions and friction points this paper is trying to address.

multimodal agents
professional software
benchmarking
hierarchical capabilities
software workflows
Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal agents
professional software benchmarking
capability hierarchy
execution-based evaluation
human-in-the-loop evaluation
🔎 Similar Papers
J
Jiaxin Ai
WHU
Y
Yukang Feng
NKU
F
Fanrui Zhang
USTC
Jianwen Sun
Jianwen Sun
Software Engineering Application Technology Lab, Huawei, China
Software engineeringDeep reinforcement learning
Z
Zizhen Li
NKU
C
Chuanhao Li
Shanghai AI Lab
Y
Yifan Chang
USTC
W
Wenxiao Wu
HUST
R
Ruoxi Wang
PITT
M
Mingliang Zhai
USTB
Kaipeng Zhang
Kaipeng Zhang
Shanghai AI Laboratory
LLMMultimodal LLMsAIGC