ClawBench: Can AI Agents Complete Everyday Online Tasks?

📅 2026-04-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing benchmarks struggle to evaluate the ability of AI agents to perform multi-step everyday tasks—such as shopping, booking appointments, or applying for jobs—in real-world online environments. To address this gap, this work proposes ClawBench, the first large-scale evaluation framework constructed on 144 live production websites and encompassing 153 realistic tasks across 15 categories. Leveraging lightweight request interception, user documentation parsing, cross-platform navigation, and automated form filling, ClawBench enables end-to-end testing while ensuring safety and avoiding side effects. Evaluation of seven state-of-the-art models reveals that even the best-performing agent, Claude Sonnet 4.6, achieves only a 33.3% success rate, highlighting substantial limitations in current AI agents’ capabilities for dynamic interaction and extended procedural reasoning.
📝 Abstract
AI agents may be able to automate your inbox, but can they automate other routine aspects of your life? Everyday online tasks offer a realistic yet unsolved testbed for evaluating the next generation of AI agents. To this end, we introduce ClawBench, an evaluation framework of 153 simple tasks that people need to accomplish regularly in their lives and work, spanning 144 live platforms across 15 categories, from completing purchases and booking appointments to submitting job applications. These tasks require demanding capabilities beyond existing benchmarks, such as obtaining relevant information from user-provided documents, navigating multi-step workflows across diverse platforms, and write-heavy operations like filling in many detailed forms correctly. Unlike existing benchmarks that evaluate agents in offline sandboxes with static pages, ClawBench operates on production websites, preserving the full complexity, dynamic nature, and challenges of real-world web interaction. A lightweight interception layer captures and blocks only the final submission request, ensuring safe evaluation without real-world side effects. Our evaluations of 7 frontier models show that both proprietary and open-source models can complete only a small portion of these tasks. For example, Claude Sonnet 4.6 achieves only 33.3%. Progress on ClawBench brings us closer to AI agents that can function as reliable general-purpose assistants.
Problem

Research questions and friction points this paper is trying to address.

AI agents
everyday online tasks
real-world web interaction
evaluation benchmark
multi-step workflows
Innovation

Methods, ideas, or system contributions that make the work stand out.

AI agents
real-world evaluation
web automation
dynamic environments
safe interaction
🔎 Similar Papers
No similar papers found.