WebArXiv: Evaluating Multimodal Agents on Time-Invariant arXiv Tasks

📅 2025-07-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM-based web agents are evaluated using dynamic benchmarks that suffer from instability and simulation artifacts. This work introduces WebArXiv—a static, time-invariant benchmark built from real arXiv webpage snapshots—featuring 275 tasks with deterministic ground-truth annotations and standardized action-trajectory evaluation. To enhance agent reasoning, we propose a novel dynamic reflection mechanism enabling selective retrieval of salient historical steps, thereby mitigating rigid overreliance on redundant interactions and improving decision adaptability. Through fine-grained behavioral failure-mode analysis across ten state-of-the-art agents, we demonstrate WebArXiv’s strong discriminative power. Empirical results show that our reflection mechanism significantly improves task success rates. WebArXiv establishes a new paradigm for reliable, reproducible evaluation and capability advancement of multimodal web agents.

Technology Category

Application Category

📝 Abstract
Recent progress in large language models (LLMs) has enabled the development of autonomous web agents capable of navigating and interacting with real websites. However, evaluating such agents remains challenging due to the instability and inconsistency of existing benchmarks, which often rely on dynamic content or oversimplified simulations. In this work, we introduce WebArXiv, a static and time-invariant benchmark comprising 275 web-based tasks grounded in the arXiv platform. WebArXiv ensures reproducible and reliable evaluation by anchoring tasks in fixed web snapshots with deterministic ground truths and standardized action trajectories. Through behavioral analysis, we identify a common failure mode, Rigid History Reflection, where agents over-rely on fixed interaction histories. To address this, we propose a lightweight dynamic reflection mechanism that allows agents to selectively retrieve relevant past steps during decision-making. We evaluate ten state-of-the-art web agents on WebArXiv. Results demonstrate clear performance differences across agents and validate the effectiveness of our proposed reflection strategy.
Problem

Research questions and friction points this paper is trying to address.

Evaluating web agents on dynamic benchmarks is unreliable
Existing benchmarks lack static, reproducible web-based tasks
Agents over-rely on fixed interaction histories (Rigid History Reflection)
Innovation

Methods, ideas, or system contributions that make the work stand out.

Static time-invariant benchmark for web agents
Dynamic reflection mechanism for decision-making
Fixed web snapshots ensure reproducible evaluation
🔎 Similar Papers
No similar papers found.