SaaS-Bench: Can Computer-Use Agents Leverage Real-World SaaS to Solve Professional Workflows?

📅 2026-05-15

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

Existing agent evaluation benchmarks are often confined to simplified environments or short-duration tasks, failing to assess capabilities within authentic professional workflows. This work proposes SaaS-Bench, the first benchmark grounded in 23 deployable Software-as-a-Service (SaaS) systems, encompassing 106 real-world tasks spanning six domains that require long-horizon execution, cross-application coordination, and domain-specific knowledge. Supporting both textual and multimodal inputs, SaaS-Bench introduces weighted validation checkpoints to measure task progress and completion. By systematically integrating real SaaS environments into agent evaluation, this benchmark emphasizes long-term planning, dynamic state tracking, and inter-application orchestration. Experimental results reveal that even the strongest current LLM-based agents achieve an end-to-end success rate below 4%, highlighting critical deficiencies in task planning, state maintenance, context retention, and error recovery.

📝 Abstract

Computer-Using Agents (CUAs) are rapidly extending large language models (LLMs) beyond text-based reasoning toward action execution in more complex environments, such as web browsers and graphical user interfaces (GUIs). However, existing web and GUI agent benchmarks often rely on simplified settings, isolated tasks, or short-horizon interactions, making it difficult to assess capabilities of agents in realistic professional workflows. Software-as-a-Service (SaaS) environments are a natural choice for CUA evaluation, as they host a large share of modern digital work and naturally involve dynamic system states, cross-application coordination, domain-specific knowledge, and long-horizon dependencies. To this end, we introduce SaaS-Bench, a benchmark built on 23 deployable SaaS systems across six professional domains, containing 106 tasks grounded in realistic work scenarios. These tasks require long-horizon execution, cover both text-only and multimodal settings, and are evaluated with weighted verification checkpoints that measure strict task completion and partial progress. Experiments show that representative LLM-based agents struggle on SaaS-Bench, with even the strongest model completing fewer than 4% of tasks end-to-end, exposing limitations in planning, state tracking, cross-application context maintenance, and error recovery. Code are available at https://github.com/UniPat-AI/SaaS-Bench for reproduction.

Problem

Research questions and friction points this paper is trying to address.

Computer-Using Agents

SaaS

professional workflows

benchmark

long-horizon tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

SaaS-Bench

Computer-Using Agents

long-horizon tasks