ScholarGym: Benchmarking Deep Research Workflows on Academic Literature Retrieval

📅 2026-01-29

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work addresses the challenge of non-determinism in existing real-time API–based evaluation frameworks for deep research workflows, which undermines reproducibility and cross-system comparison. We propose the first deterministic simulation benchmark tailored for academic literature exploration, decoupling the workflow into three stages: query planning, tool invocation, and relevance assessment. Leveraging a static corpus of 570,000 papers and 2,536 expert-annotated queries, we conduct end-to-end experiments with multiple large language models. Our results reveal significant differences among models in reasoning capabilities, planning strategies, and selection mechanisms, which critically influence multi-turn iterative performance. This framework establishes a reproducible, fine-grained foundation for evaluating and optimizing deep research workflows, offering key insights into the design of effective agent-based scholarly search systems.

Technology Category

Application Category

📝 Abstract

Tool-augmented large language models have advanced from single-turn question answering to deep research workflows that iteratively plan queries, invoke external tools, and synthesize information to address complex information needs. Evaluating such workflows presents a fundamental challenge: reliance on live APIs introduces non-determinism, as tool invocations may yield different results across runs due to temporal drift, rate limiting, and evolving backend states. This variance undermines reproducibility and invalidates cross-system comparisons. We present ScholarGym, a simulation environment for reproducible evaluation of deep research workflows on academic literature. The environment decouples workflow components into query planning, tool invocation, and relevance assessment, enabling fine-grained analysis of each stage under controlled conditions. Built on a static corpus of 570K papers with deterministic retrieval, ScholarGym provides 2,536 queries with expert-annotated ground truth. Experiments across diverse backbone models reveal how reasoning capabilities, planning strategies, and selection mechanisms interact over iterative refinement.

Problem

Research questions and friction points this paper is trying to address.

deep research workflows

academic literature retrieval

reproducibility

non-determinism

evaluation benchmark

Innovation

Methods, ideas, or system contributions that make the work stand out.

ScholarGym

deep research workflows

reproducible evaluation