Dr.Mi-Bench: A Modular-integrated Benchmark for Scientific Deep Research Agent

📅 2025-11-30

📈 Citations: 0

✨ Influential: 0

career value

243K/year

🤖 AI Summary

Existing benchmarks overemphasize retrieval while neglecting the high-level planning and cross-disciplinary reasoning capabilities essential for scientific deep-research (DR) agents. Method: We introduce Dr.Mi-Eval—the first modular, integrated benchmark tailored for scientific DR agents—comprising 10 disciplines and 200 expert-annotated samples, supporting both end-to-end and isolated evaluation modes. Its multidimensional capability-decoupling framework, grounded in academic paper structure, separately assesses planning, retrieval, and reasoning. Contribution/Results: Experiments reveal systematic deficiencies in current agents on multi-source retrieval and cross-disciplinary tasks, confirming high-level planning as a critical bottleneck constraining large language models’ reasoning potential. Dr.Mi-Eval provides interpretable diagnostic pathways, establishing a methodological foundation for targeted optimization of DR agents.

Technology Category

Application Category

📝 Abstract

The explosive growth in academic literature necessitates automated deep research (DR) agents, yet their evaluation remains a significant challenge. First, existing benchmarks often focus narrowly on retrieval while neglecting high-level planning and reasoning. Second, existing benchmarks favor general domains over the scientific domains that are the core application for DR agents. To address these gaps, we introduce Dr.Mi-Bench, a Modular-integrated benchmark for scientific DR agents. Grounded in academic literature, our benchmark uses a human-annotated dataset of 200 instances across 10 scientific domains, including both research and review papers. Besides, we also propose a Modular-integrated Evaluation Paradigm for DR Agents (Dr.Mi-Eval), a novel modular-integrated evaluation paradigm, which leverages the rich structure of academic papers to assess the core competencies of planning, retrieval, and reasoning through two complementary modes: an end-to-end evaluation for DR agents and an isolated evaluation for foundational LLMs as potential backbones. Experimental results reveal a fragmented performance landscape: agents exhibit specialized strengths but share critical weaknesses, most notably in performing the multi-source retrieval required for review-style tasks and performing consistently across diverse scientific fields. Moreover, improving high-level planning capability is the crucial factor for unlocking the reasoning potential of foundational LLMs as backbones. By exposing these actionable failure modes, Dr.Mi-Bench provides a diagnostic tool to guide the development of more reliable academic research assistants.

Problem

Research questions and friction points this paper is trying to address.

Evaluating automated deep research agents' planning and reasoning abilities

Addressing lack of scientific domain focus in existing benchmarks

Assessing multi-source retrieval and cross-field consistency in research

Innovation

Methods, ideas, or system contributions that make the work stand out.

Modular-integrated benchmark for scientific deep research agents

Human-annotated dataset across 10 scientific domains

Evaluation paradigm assessing planning, retrieval, and reasoning

🔎 Similar Papers

ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery