🤖 AI Summary
Current web agents lack reliable evaluation on complex, real-world tasks. This paper introduces BEARCUBS, a benchmark targeting search, browsing, and factual identification tasks that require real-time web access and multimodal interaction—including keyboard, mouse, visual, video, and 3D modalities. It comprises 111 information retrieval questions, each accompanied by human-verified browsing trajectories and ground-truth answers, and explicitly prohibits text-only or bypass-based solutions by mandating non-textual interaction. Its key contributions are: (1) the first multimodal evaluation paradigm grounded in real-browser automation and dynamic web scraping; and (2) a human trajectory annotation protocol with periodic benchmark updates. Experiments reveal a human accuracy of 84.7%, while the best AI system (OpenAI Operator) achieves only 24.3%, exposing two critical bottlenecks—unreliable source selection and weak multimodal understanding. BEARCUBS thus establishes a high-stakes, reproducible evaluation standard for web-based autonomous agents.
📝 Abstract
Modern web agents possess computer use abilities that allow them to interact with webpages by sending commands to a virtual keyboard and mouse. While such agents have considerable potential to assist human users with complex tasks, evaluating their capabilities in real-world settings poses a major challenge. To this end, we introduce BEARCUBS, a"small but mighty"benchmark of 111 information-seeking questions designed to evaluate a web agent's ability to search, browse, and identify factual information from the web. Unlike prior web agent benchmarks, solving BEARCUBS requires (1) accessing live web content rather than synthetic or simulated pages, which captures the unpredictability of real-world web interactions; and (2) performing a broad range of multimodal interactions (e.g., video understanding, 3D navigation) that cannot be bypassed via text-based workarounds. Each question in BEARCUBS has a corresponding short, unambiguous answer and a human-validated browsing trajectory, allowing for transparent evaluation of agent performance and strategies. A human study confirms that BEARCUBS questions are solvable but non-trivial (84.7% human accuracy), revealing search inefficiencies and domain knowledge gaps as common failure points. By contrast, state-of-the-art computer-using agents underperform, with the best-scoring system (OpenAI's Operator) reaching only 24.3% accuracy. These results highlight critical areas for improvement, including reliable source selection and more powerful multimodal capabilities. To facilitate future research, BEARCUBS will be updated periodically to replace invalid or contaminated questions, keeping the benchmark fresh for future generations of web agents.