LiveResearchBench: A Live Benchmark for User-Centric Deep Research in the Wild

📅 2025-10-15

📈 Citations: 0

✨ Influential: 0

career value

248K/year

🤖 AI Summary

Existing deep research evaluation benchmarks suffer from narrow domain coverage, ambiguous problem definitions, and static, outdated task formulations—failing to satisfy four core principles: user-centricity, dynamism, clarity, and multifaceted search capability. To address this, we introduce the first dynamic, real-world information need–oriented benchmark comprising 100 diverse, temporally grounded deep research tasks spanning daily life, enterprise, and academic domains; systems must generate citation-reliable, long-form reports using live web search. We propose DeepEval, a comprehensive evaluation framework integrating citation-link analysis, content coherence detection, structured scoring, and human alignment to enable high-agreement automated assessment of coverage, citation accuracy, and analytical depth. Evaluating 17 state-of-the-art systems reveals critical weaknesses—including poor source grounding and shallow synthesis—and identifies key architectural components for improvement. Our benchmark provides a reproducible, high-fidelity foundation for advancing deep research systems.

Technology Category

Application Category

📝 Abstract

Deep research -- producing comprehensive, citation-grounded reports by searching and synthesizing information from hundreds of live web sources -- marks an important frontier for agentic systems. To rigorously evaluate this ability, four principles are essential: tasks should be (1) user-centric, reflecting realistic information needs, (2) dynamic, requiring up-to-date information beyond parametric knowledge, (3) unambiguous, ensuring consistent interpretation across users, and (4) multi-faceted and search-intensive, requiring search over numerous web sources and in-depth analysis. Existing benchmarks fall short of these principles, often focusing on narrow domains or posing ambiguous questions that hinder fair comparison. Guided by these principles, we introduce LiveResearchBench, a benchmark of 100 expert-curated tasks spanning daily life, enterprise, and academia, each requiring extensive, dynamic, real-time web search and synthesis. Built with over 1,500 hours of human labor, LiveResearchBench provides a rigorous basis for systematic evaluation. To evaluate citation-grounded long-form reports, we introduce DeepEval, a comprehensive suite covering both content- and report-level quality, including coverage, presentation, citation accuracy and association, consistency and depth of analysis. DeepEval integrates four complementary evaluation protocols, each designed to ensure stable assessment and high agreement with human judgments. Using LiveResearchBench and DeepEval, we conduct a comprehensive evaluation of 17 frontier deep research systems, including single-agent web search, single-agent deep research, and multi-agent systems. Our analysis reveals current strengths, recurring failure modes, and key system components needed to advance reliable, insightful deep research.

Problem

Research questions and friction points this paper is trying to address.

Evaluating comprehensive citation-grounded reports from web sources

Assessing dynamic information synthesis beyond parametric knowledge

Measuring multi-faceted search-intensive analysis across diverse domains

Innovation

Methods, ideas, or system contributions that make the work stand out.

Live benchmark with expert-curated tasks

DeepEval suite for comprehensive report evaluation

Multi-protocol assessment ensuring human judgment alignment

🔎 Similar Papers

No similar papers found.