🤖 AI Summary
Existing agent evaluation benchmarks are often confined to simplified environments or short-duration tasks, failing to assess capabilities within authentic professional workflows. This work proposes SaaS-Bench, the first benchmark grounded in 23 deployable Software-as-a-Service (SaaS) systems, encompassing 106 real-world tasks spanning six domains that require long-horizon execution, cross-application coordination, and domain-specific knowledge. Supporting both textual and multimodal inputs, SaaS-Bench introduces weighted validation checkpoints to measure task progress and completion. By systematically integrating real SaaS environments into agent evaluation, this benchmark emphasizes long-term planning, dynamic state tracking, and inter-application orchestration. Experimental results reveal that even the strongest current LLM-based agents achieve an end-to-end success rate below 4%, highlighting critical deficiencies in task planning, state maintenance, context retention, and error recovery.
📝 Abstract
Computer-Using Agents (CUAs) are rapidly extending large language models (LLMs) beyond text-based reasoning toward action execution in more complex environments, such as web browsers and graphical user interfaces (GUIs). However, existing web and GUI agent benchmarks often rely on simplified settings, isolated tasks, or short-horizon interactions, making it difficult to assess capabilities of agents in realistic professional workflows. Software-as-a-Service (SaaS) environments are a natural choice for CUA evaluation, as they host a large share of modern digital work and naturally involve dynamic system states, cross-application coordination, domain-specific knowledge, and long-horizon dependencies. To this end, we introduce SaaS-Bench, a benchmark built on 23 deployable SaaS systems across six professional domains, containing 106 tasks grounded in realistic work scenarios. These tasks require long-horizon execution, cover both text-only and multimodal settings, and are evaluated with weighted verification checkpoints that measure strict task completion and partial progress. Experiments show that representative LLM-based agents struggle on SaaS-Bench, with even the strongest model completing fewer than 4% of tasks end-to-end, exposing limitations in planning, state tracking, cross-application context maintenance, and error recovery. Code are available at https://github.com/UniPat-AI/SaaS-Bench for reproduction.