🤖 AI Summary
Existing benchmarks overemphasize retrieval while neglecting the high-level planning and cross-disciplinary reasoning capabilities essential for scientific deep-research (DR) agents. Method: We introduce Dr.Mi-Eval—the first modular, integrated benchmark tailored for scientific DR agents—comprising 10 disciplines and 200 expert-annotated samples, supporting both end-to-end and isolated evaluation modes. Its multidimensional capability-decoupling framework, grounded in academic paper structure, separately assesses planning, retrieval, and reasoning. Contribution/Results: Experiments reveal systematic deficiencies in current agents on multi-source retrieval and cross-disciplinary tasks, confirming high-level planning as a critical bottleneck constraining large language models’ reasoning potential. Dr.Mi-Eval provides interpretable diagnostic pathways, establishing a methodological foundation for targeted optimization of DR agents.
📝 Abstract
The explosive growth in academic literature necessitates automated deep research (DR) agents, yet their evaluation remains a significant challenge. First, existing benchmarks often focus narrowly on retrieval while neglecting high-level planning and reasoning. Second, existing benchmarks favor general domains over the scientific domains that are the core application for DR agents. To address these gaps, we introduce Dr.Mi-Bench, a Modular-integrated benchmark for scientific DR agents. Grounded in academic literature, our benchmark uses a human-annotated dataset of 200 instances across 10 scientific domains, including both research and review papers. Besides, we also propose a Modular-integrated Evaluation Paradigm for DR Agents (Dr.Mi-Eval), a novel modular-integrated evaluation paradigm, which leverages the rich structure of academic papers to assess the core competencies of planning, retrieval, and reasoning through two complementary modes: an end-to-end evaluation for DR agents and an isolated evaluation for foundational LLMs as potential backbones. Experimental results reveal a fragmented performance landscape: agents exhibit specialized strengths but share critical weaknesses, most notably in performing the multi-source retrieval required for review-style tasks and performing consistently across diverse scientific fields. Moreover, improving high-level planning capability is the crucial factor for unlocking the reasoning potential of foundational LLMs as backbones. By exposing these actionable failure modes, Dr.Mi-Bench provides a diagnostic tool to guide the development of more reliable academic research assistants.