Odysseys: Benchmarking Web Agents on Realistic Long Horizon Tasks

📅 2026-04-27

📈 Citations: 0

✨ Influential: 0

career value

235K/year

🤖 AI Summary

Current evaluations of web agents are largely confined to short-duration, single-site tasks, failing to capture the complexities of real-world workflows that span multiple sites and extended timeframes. To address this limitation, this work introduces a benchmark comprising 200 long-horizon web tasks derived from authentic user behavior, designed to assess agents’ sustained contextual understanding and cross-site reasoning capabilities in an open-web environment. We propose a fine-grained scoring mechanism that replaces binary success judgments with human-annotated, multi-dimensional criteria—averaging 6.1 rules per task—and introduce a trajectory efficiency metric to quantify task completion quality per action step. Evaluation combines LLM-as-a-judge validation with end-to-end execution testing. Results reveal that even the strongest existing models achieve only a 44.5% task success rate and a trajectory efficiency of merely 1.15%, underscoring their significant shortcomings in handling complex, long-horizon web tasks.

📝 Abstract

Existing web agent benchmarks have largely converged on short, single-site tasks that frontier models are approaching saturation on. However, real world web use consists of long-horizon, multi-site workflows. Common web navigation tasks, such as comparing products across different domains, planning trips across multiple services, or summarizing information from multiple search queries, require sustained context and cross-site reasoning over potentially hours of browsing. To capture and evaluate such behaviors, we introduce Odysseys: a benchmark of 200 long-horizon web tasks derived from real world browsing sessions evaluated on the live Internet. We find that binary pass/fail evaluation is inadequate for long-horizon settings and introduce a rubric-based evaluation, annotating each Odysseys task with an average of 6.1 graded rubrics. We demonstrate that this yields higher agreement with humans and provides a more fine-grained signal than commonly used trajectory-level LLM-as-a-judge evaluation metrics. We tested several leading frontier models and find that the strongest models achieve a success rate of 44.5%, which leaves substantial room for future improvements. Beyond task success, we argue that efficiency is a first-class concern for long-horizon agents. We introduce a Trajectory Efficiency metric (rubric score per step) and find that even frontier agents achieve only 1.15%, marking an evident need for agents that can succeed efficiently and not simply eventually. Odysseys isolates the critical evaluation of long-horizon proficiency in open-web environments, providing a realistic benchmark to measure progress towards computer-use agents that can potentially productively operate for hours. We release our tasks, evaluation scripts, and other results at https://odysseys-website.pages.dev

Problem

Research questions and friction points this paper is trying to address.

web agents

long-horizon tasks

benchmarking

cross-site reasoning

realistic evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

long-horizon tasks

rubric-based evaluation

web agents