WARC-Bench: Web Archive Based Benchmark for GUI Subtask Executions

📅 2025-10-10

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Existing benchmarks lack fine-grained evaluation capabilities for multimodal AI agents performing subtasks in web navigation—such as date selection and scroll positioning. This paper introduces WebNav, the first benchmark explicitly designed for GUI-level subtask evaluation. Built upon the Web ARChive (WARC), it provides a reproducible, sandboxed, dynamic web interaction environment that faithfully reconstructs real-world interface behaviors. Methodologically, we initialize models via supervised fine-tuning (SFT) and propose verifiable-reward reinforcement learning (RLVR) to mitigate data scarcity. Experiments show that our best-performing model achieves a success rate of 64.8%; RLVR improves the SFT baseline from 48.8% to 52.8%, significantly outperforming multiple state-of-the-art models. This work fills a critical gap in fine-grained web interaction evaluation and establishes a new paradigm for assessing controllable, low-level operational capabilities of multimodal agents.

Technology Category

Application Category

📝 Abstract

Training web agents to navigate complex, real-world websites requires them to master $ extit{subtasks}$ - short-horizon interactions on multiple UI components (e.g., choosing the correct date in a date picker, or scrolling in a container to extract information). We introduce WARC-Bench (Web Archive Benchmark), a novel web navigation benchmark featuring 438 tasks designed to evaluate multimodal AI agents on subtasks. WARC-Bench enables sandboxed interactions with dynamic and realistic webpages using Web ARChive files. We show that WARC-Bench is challenging for leading computer-use models, with the highest observed success rate being 64.8%. To improve open source models on subtask, we explore two common training techniques: supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR). Experiments show that SFT models obtain a 48.8% success rate on the benchmark. Training with RLVR over SFT checkpoints, even in data-scarce settings, improves the score to 52.8% on WARC-Bench, outperforming many frontier models. Our analysis concludes that mastering these subtasks is essential for robust web planning and navigation, and is a capability not extensively evaluated by existing benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Benchmark evaluates multimodal AI agents on web GUI subtasks

Training techniques improve agent performance on complex webpage interactions

Addresses capability gap in existing benchmarks for robust web navigation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Web ARChive files enable sandboxed webpage interactions

Supervised fine-tuning trains models on subtask execution

Reinforcement learning with verifiable rewards enhances performance

🔎 Similar Papers

No similar papers found.