InfoDeepSeek: Benchmarking Agentic Information Seeking for Retrieval-Augmented Generation

📅 2025-05-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing RAG benchmarks—relying on static corpora, fixed queries, and gold-document evaluation—fail to capture the dynamic information-seeking capabilities of agent-based RAG in open-web environments. Method: We introduce the first benchmark for intelligent web search evaluation tailored to dynamic Web settings, featuring a novel open-ended evaluation framework grounded in realistic webpage simulation. We propose a challenge question generation method balancing determinism, difficulty, and diversity; define a tri-dimensional fine-grained evaluation metric encompassing accuracy, utility, and conciseness; and integrate multi-LLM collaborative retrieval, scalable search trajectory modeling, and automated multi-dimensional assessment. Results: Extensive experiments across diverse LLMs, search engines, and query types reveal systematic agent behavioral differences, establish reproducible baselines, and yield transferable insights for optimizing agent-based RAG systems.

Technology Category

Application Category

📝 Abstract
Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by grounding responses with retrieved information. As an emerging paradigm, Agentic RAG further enhances this process by introducing autonomous LLM agents into the information seeking process. However, existing benchmarks fall short in evaluating such systems, as they are confined to a static retrieval environment with a fixed, limited corpus} and simple queries that fail to elicit agentic behavior. Moreover, their evaluation protocols assess information seeking effectiveness by pre-defined gold sets of documents, making them unsuitable for the open-ended and dynamic nature of real-world web environments. To bridge this gap, we present InfoDeepSeek, a new benchmark with challenging questions designed for assessing agentic information seeking in real-world, dynamic web environments. We propose a systematic methodology for constructing challenging queries satisfying the criteria of determinacy, difficulty, and diversity. Based on this, we develop the first evaluation framework tailored to dynamic agentic information seeking, including fine-grained metrics about the accuracy, utility, and compactness of information seeking outcomes. Through extensive experiments across LLMs, search engines, and question types, InfoDeepSeek reveals nuanced agent behaviors and offers actionable insights for future research.
Problem

Research questions and friction points this paper is trying to address.

Evaluating agentic RAG in dynamic web environments
Lack of benchmarks for autonomous LLM agents
Assessing information seeking accuracy and utility
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces autonomous LLM agents for dynamic RAG
Develops challenging queries with determinacy and diversity
Creates evaluation framework for dynamic agentic seeking
🔎 Similar Papers
No similar papers found.
Yunjia Xi
Yunjia Xi
Shanghai Jiao Tong University
LLMsAgentRecommendation
Jianghao Lin
Jianghao Lin
Shanghai Jiao Tong University
Large Language ModelsAI AgentsRecommender Systems
M
Menghui Zhu
Huawei Noah’s Ark Lab
Y
Yongzhao Xiao
Shanghai Jiao Tong University
Z
Zhuoying Ou
Shanghai Jiao Tong University
J
Jiaqi Liu
Shanghai Jiao Tong University
T
Tong Wan
Shanghai Jiao Tong University
B
Bo Chen
Huawei Noah’s Ark Lab
Weiwen Liu
Weiwen Liu
Associate Professor, Shanghai Jiao Tong University
large language modelsAI agentsrecommender systems
Yasheng Wang
Yasheng Wang
Tencent
Natural Language Processing
R
Ruiming Tang
Huawei Noah’s Ark Lab
W
Weinan Zhang
Shanghai Jiao Tong University
Yong Yu
Yong Yu
Materials Engineer
Polymer matrix compositeadhesivemodelingtest development