DeepScholar-Bench: A Live Benchmark and Automated Evaluation for Generative Research Synthesis

📅 2025-08-27

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing evaluation methods fail to capture the intrinsic complexity of generative research summarization: QA benchmarks emphasize short factual answers, while expert-curated datasets suffer from obsolescence and contamination. To address this, we introduce LOTUS—the first dynamic benchmark tailored for generative research summarization—constructed from real-time arXiv papers to enable end-to-end evaluation of retrieval, synthesis, and verifiability. We propose a novel three-dimensional automated evaluation framework assessing knowledge integration, retrieval quality, and result verifiability, and release DeepScholar-base, an efficient baseline system. Experiments reveal that state-of-the-art systems achieve less than 19% on the composite metric, confirming LOTUS’s high difficulty and non-saturation. LOTUS establishes a reliable, dynamic, and reproducible evaluation paradigm for the field.

Technology Category

Application Category

📝 Abstract

The ability to research and synthesize knowledge is central to human expertise and progress. An emerging class of systems promises these exciting capabilities through generative research synthesis, performing retrieval over the live web and synthesizing discovered sources into long-form, cited summaries. However, evaluating such systems remains an open challenge: existing question-answering benchmarks focus on short-form factual responses, while expert-curated datasets risk staleness and data contamination. Both fail to capture the complexity and evolving nature of real research synthesis tasks. In this work, we introduce DeepScholar-bench, a live benchmark and holistic, automated evaluation framework designed to evaluate generative research synthesis. DeepScholar-bench draws queries from recent, high-quality ArXiv papers and focuses on a real research synthesis task: generating the related work sections of a paper by retrieving, synthesizing, and citing prior research. Our evaluation framework holistically assesses performance across three key dimensions, knowledge synthesis, retrieval quality, and verifiability. We also develop DeepScholar-base, a reference pipeline implemented efficiently using the LOTUS API. Using the DeepScholar-bench framework, we perform a systematic evaluation of prior open-source systems, search AI's, OpenAI's DeepResearch, and DeepScholar-base. We find that DeepScholar-base establishes a strong baseline, attaining competitive or higher performance than each other method. We also find that DeepScholar-bench remains far from saturated, with no system exceeding a score of $19%$ across all metrics. These results underscore the difficulty of DeepScholar-bench, as well as its importance for progress towards AI systems capable of generative research synthesis. We make our code available at https://github.com/guestrin-lab/deepscholar-bench.

Problem

Research questions and friction points this paper is trying to address.

Evaluating generative research synthesis systems' complex capabilities

Assessing AI systems on knowledge synthesis and verifiability dimensions

Measuring retrieval quality in live research synthesis tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Live benchmark using recent ArXiv papers

Automated evaluation framework for synthesis tasks

Reference pipeline implemented via LOTUS API

🔎 Similar Papers

No similar papers found.

Authors to Follow