WorkArena++: Towards Compositional Planning and Reasoning-based Common Knowledge Work Tasks

📅 2024-07-07

🏛️ Neural Information Processing Systems

📈 Citations: 3

✨ Influential: 0

career value

221K/year

🤖 AI Summary

This work addresses critical bottlenecks—planning fragmentation, context forgetting, and cross-step reasoning failure—that hinder large language models (LLMs) in enterprise-scale autonomous agent tasks. To this end, we introduce WorkArena++, the first benchmark explicitly designed for realistic knowledge-worker workflows, comprising 682 office-scenario tasks. Methodologically, we propose an end-to-end web interaction framework integrating LLMs and vision-language models (VLMs), programmatically synthesized tasks, and human behavioral annotations to enable high-fidelity, observation-action trajectory generation. Our contributions are threefold: (1) the first systematic evaluation of agent capabilities across planning, logical/arithmetic reasoning, information retrieval, and contextual understanding; (2) empirical revelation of fundamental limitations of state-of-the-art models in complex, multi-step workflows; and (3) open-sourcing of the benchmark and over two thousand ground-truth trajectories—substantially improving fine-tuning efficiency and experimental reproducibility.

Technology Category

Application Category

📝 Abstract

The ability of large language models (LLMs) to mimic human-like intelligence has led to a surge in LLM-based autonomous agents. Though recent LLMs seem capable of planning and reasoning given user instructions, their effectiveness in applying these capabilities for autonomous task solving remains underexplored. This is especially true in enterprise settings, where automated agents hold the promise of a high impact. To fill this gap, we propose WorkArena++, a novel benchmark consisting of 682 tasks corresponding to realistic workflows routinely performed by knowledge workers. WorkArena++ is designed to evaluate the planning, problem-solving, logical/arithmetic reasoning, retrieval, and contextual understanding abilities of web agents. Our empirical studies across state-of-the-art LLMs and vision-language models (VLMs), as well as human workers, reveal several challenges for such models to serve as useful assistants in the workplace. In addition to the benchmark, we provide a mechanism to effortlessly generate thousands of ground-truth observation/action traces, which can be used for fine-tuning existing models. Overall, we expect this work to serve as a useful resource to help the community progress toward capable autonomous agents. The benchmark can be found at https://github.com/ServiceNow/WorkArena.

Problem

Research questions and friction points this paper is trying to address.

Evaluates autonomous agents' task-solving abilities.

Assesses planning and reasoning in enterprise workflows.

Generates ground-truth data for model fine-tuning.

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based autonomous agents

Compositional planning and reasoning

Ground-truth observation/action traces

🔎 Similar Papers

No similar papers found.