DeepResearch Arena: The First Exam of LLMs' Research Abilities via Seminar-Grounded Tasks

📅 2025-09-01

📈 Citations: 0

✨ Influential: 0

career value

236K/year

🤖 AI Summary

Existing benchmarks inadequately assess deep research agents’ capabilities in frontier problem discovery and multi-stage scientific inquiry—including literature review, methodology design, and empirical validation. Method: We introduce DeepResearch Arena, the first evaluation benchmark grounded in real academic workshops. It innovatively leverages expert discussion transcripts and employs a multi-agent hierarchical task generation system to automatically extract research inspirations and construct high-quality, traceable, low-leakage scientific tasks. Contribution/Results: The benchmark comprises over 10,000 tasks spanning multiple disciplines. Extensive experiments reveal that current state-of-the-art large language models perform substantially below expectations on this benchmark, underscoring its rigor, realism, and diagnostic value for evaluating advanced scientific reasoning and autonomous research capabilities.

Technology Category

Application Category

📝 Abstract

Deep research agents have attracted growing attention for their potential to orchestrate multi-stage research workflows, spanning literature synthesis, methodological design, and empirical verification. Despite these strides, evaluating their research capability faithfully is rather challenging due to the difficulty of collecting frontier research questions that genuinely capture researchers' attention and intellectual curiosity. To address this gap, we introduce DeepResearch Arena, a benchmark grounded in academic seminars that capture rich expert discourse and interaction, better reflecting real-world research environments and reducing the risk of data leakage. To automatically construct DeepResearch Arena, we propose a Multi-Agent Hierarchical Task Generation (MAHTG) system that extracts research-worthy inspirations from seminar transcripts. The MAHTG system further translates research-worthy inspirations into high-quality research tasks, ensuring the traceability of research task formulation while filtering noise. With the MAHTG system, we curate DeepResearch Arena with over 10,000 high-quality research tasks from over 200 academic seminars, spanning 12 disciplines, such as literature, history, and science. Our extensive evaluation shows that DeepResearch Arena presents substantial challenges for current state-of-the-art agents, with clear performance gaps observed across different models.

Problem

Research questions and friction points this paper is trying to address.

Evaluating deep research agents' multi-stage workflow capabilities

Collecting frontier research questions reflecting genuine intellectual curiosity

Automatically generating high-quality research tasks from academic seminars

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Agent Hierarchical Task Generation system

Extracts research inspirations from seminar transcripts

Generates high-quality traceable research tasks

🔎 Similar Papers

ResearchArena: Benchmarking Large Language Models' Ability to Collect and Organize Information as Research Agents