WebForge: Breaking the Realism-Reproducibility-Scalability Trilemma in Browser Agent Benchmark

📅 2026-04-13
📈 Citations: 0
Influential: 0
📄 PDF

career value

218K/year
🤖 AI Summary
Existing browser agent benchmarks struggle to simultaneously achieve realism, reproducibility, and scalability. This work proposes WebForge, a fully automated, end-to-end interactive web environment constructed through a four-stage multi-agent pipeline—Plan, Generate, Refine, and Validate—requiring no human annotation. WebForge introduces a seven-dimensional difficulty control mechanism encompassing navigation depth, visual complexity, reasoning difficulty, and other factors. Leveraging this framework, the authors construct WebForge-Bench, comprising 934 cross-domain tasks spanning seven domains and three difficulty levels. Experiments demonstrate that this benchmark is the first to unify realism, reproducibility, and scalability, and its multidimensional evaluation effectively reveals model capability gaps and domain-specific biases, overcoming the limitations of traditional single-score assessments.

Technology Category

Application Category

📝 Abstract
Existing browser agent benchmarks face a fundamental trilemma: real-website benchmarks lack reproducibility due to content drift, controlled environments sacrifice realism by omitting real-web noise, and both require costly manual curation that limits scalability. We present WebForge, the first fully automated framework that resolves this trilemma through a four-agent pipeline -- Plan, Generate, Refine, and Validate -- that produces interactive, self-contained web environments end-to-end without human annotation. A seven-dimensional difficulty control framework structures task design along navigation depth, visual complexity, reasoning difficulty, and more, enabling systematic capability profiling beyond single aggregate scores. Using WebForge, we construct WebForge-Bench, a benchmark of 934 tasks spanning 7 domains and 3 difficulty levels. Multi-model experiments show that difficulty stratification effectively differentiates model capabilities, while cross-domain analysis exposes capability biases invisible to aggregate metrics. Together, these results confirm that multi-dimensional evaluation reveals distinct capability profiles that a single aggregate score cannot capture. Code and benchmark are publicly available at https://github.com/yuandaxia2001/WebForge.
Problem

Research questions and friction points this paper is trying to address.

realism
reproducibility
scalability
browser agent benchmark
trilemma
Innovation

Methods, ideas, or system contributions that make the work stand out.

browser agent benchmark
realism-reproducibility-scalability trilemma
automated web environment generation
multi-dimensional difficulty control
capability profiling
🔎 Similar Papers
No similar papers found.