Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge

📅 2025-06-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing benchmarks inadequately support open-ended, dynamic, and long-horizon agentic search due to restrictive assumptions—short search scopes, static answers, and no source attribution capability. This paper introduces Mind2Web 2, the first long-horizon benchmark tailored for large language model–driven autonomous web browsing and information synthesis, comprising 130 real-world tasks. We propose an innovative *Agent-as-a-Judge* evaluation framework: task-specific judge agents are automatically generated via hierarchical scoring rules to jointly assess answer correctness and source traceability. The benchmark integrates a realistic web interaction environment and enables cross-system comparative evaluation. Evaluated across nine state-of-the-art systems, OpenAI’s Deep Research achieves 50–70% of human performance while halving execution time. This work establishes the first automated, long-horizon, and attributable agentic search evaluation paradigm.

Technology Category

Application Category

📝 Abstract
Agentic search such as Deep Research systems, where large language models autonomously browse the web, synthesize information, and return comprehensive citation-backed answers, represents a major shift in how users interact with web-scale information. While promising greater efficiency and cognitive offloading, the growing complexity and open-endedness of agentic search have outpaced existing evaluation benchmarks and methodologies, which largely assume short search horizons and static answers. In this paper, we introduce Mind2Web 2, a benchmark of 130 realistic, high-quality, and long-horizon tasks that require real-time web browsing and extensive information synthesis, constructed with over 1,000 hours of human labor. To address the challenge of evaluating time-varying and complex answers, we propose a novel Agent-as-a-Judge framework. Our method constructs task-specific judge agents based on a tree-structured rubric design to automatically assess both answer correctness and source attribution. We conduct a comprehensive evaluation of nine frontier agentic search systems and human performance, along with a detailed error analysis to draw insights for future development. The best-performing system, OpenAI Deep Research, can already achieve 50-70% of human performance while spending half the time, showing a great potential. Altogether, Mind2Web 2 provides a rigorous foundation for developing and benchmarking the next generation of agentic search systems.
Problem

Research questions and friction points this paper is trying to address.

Evaluating complex agentic web search systems
Assessing long-horizon dynamic information synthesis
Automating correctness and source attribution judgment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Agent-as-a-Judge framework for evaluation
Tree-structured rubric for answer assessment
Real-time web browsing benchmark tasks
🔎 Similar Papers
2024-07-01arXiv.orgCitations: 85
Boyu Gou
Boyu Gou
The Ohio State University
Artificial IntelligenceLanguage AgentsGUI Agents
Zanming Huang
Zanming Huang
The Ohio State University
Machine LearningComputer Vision
Yuting Ning
Yuting Ning
The Ohio State University
Natural Language Processing
Y
Yu Gu
The Ohio State University
Michael Lin
Michael Lin
Glaucoma Specialist, Massachusetts Eye and Ear
Glaucoma
W
Weijian Qi
The Ohio State University
A
Andrei Kopanev
The Ohio State University
Botao Yu
Botao Yu
PhD student, Ohio State University
AI for ScienceNLPAI Music
B
Bernal Jiménez Gutiérrez
The Ohio State University
Yiheng Shu
Yiheng Shu
PhD student, The Ohio State University
Computational LinguisticsSemantic Parsing
C
Chan Hee Song
The Ohio State University
J
Jiaman Wu
The Ohio State University
Shijie Chen
Shijie Chen
PhD Student, The Ohio State University
Natural Language ProcessingMachine Learning
H
Hanane Nour Moussa
The Ohio State University
T
Tianshu Zhang
The Ohio State University
J
Jian Xie
The Ohio State University
Y
Yifei Li
The Ohio State University
Tianci Xue
Tianci Xue
The Ohio State University
NLP
Zeyi Liao
Zeyi Liao
The Ohio State University
AINLPMultimodalAgent
K
Kai Zhang
The Ohio State University
B
Boyuan Zheng
The Ohio State University
Zhaowei Cai
Zhaowei Cai
Amazon Artificial General Intelligence
Artificial IntelligenceComputer VisionMachine Learning
V
Viktor Rozgic
Amazon AGI
M
Morteza Ziyadi
Amazon AGI
Huan Sun
Huan Sun
Endowed CoE Innovation Scholar and Associate Professor, The Ohio State University
AgentsLarge Language ModelsNatural Language ProcessingAI