🤖 AI Summary
This work addresses the challenge of non-determinism in existing real-time API–based evaluation frameworks for deep research workflows, which undermines reproducibility and cross-system comparison. We propose the first deterministic simulation benchmark tailored for academic literature exploration, decoupling the workflow into three stages: query planning, tool invocation, and relevance assessment. Leveraging a static corpus of 570,000 papers and 2,536 expert-annotated queries, we conduct end-to-end experiments with multiple large language models. Our results reveal significant differences among models in reasoning capabilities, planning strategies, and selection mechanisms, which critically influence multi-turn iterative performance. This framework establishes a reproducible, fine-grained foundation for evaluating and optimizing deep research workflows, offering key insights into the design of effective agent-based scholarly search systems.
📝 Abstract
Tool-augmented large language models have advanced from single-turn question answering to deep research workflows that iteratively plan queries, invoke external tools, and synthesize information to address complex information needs. Evaluating such workflows presents a fundamental challenge: reliance on live APIs introduces non-determinism, as tool invocations may yield different results across runs due to temporal drift, rate limiting, and evolving backend states. This variance undermines reproducibility and invalidates cross-system comparisons. We present ScholarGym, a simulation environment for reproducible evaluation of deep research workflows on academic literature. The environment decouples workflow components into query planning, tool invocation, and relevance assessment, enabling fine-grained analysis of each stage under controlled conditions. Built on a static corpus of 570K papers with deterministic retrieval, ScholarGym provides 2,536 queries with expert-annotated ground truth. Experiments across diverse backbone models reveal how reasoning capabilities, planning strategies, and selection mechanisms interact over iterative refinement.