DeepResearch-9K: A Challenging Benchmark Dataset of Deep-Research Agent

📅 2026-03-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the scarcity of large-scale datasets and open-source training frameworks that reflect real-world complexity for deep research agents. To this end, we introduce DeepResearch-9K, a dataset comprising 9,000 multi-difficulty research-oriented questions, along with DeepResearch-R1, an open-source training framework supporting multi-turn web interaction and reinforcement learning. We propose a cost-effective method to automatically generate high-difficulty search trajectories from multi-hop question answering, augmented with reasoning chains produced by the Tongyi-DeepResearch-30B-A3B model. Leveraging LLM-as-judge and reinforcement learning, our approach trains agents that achieve state-of-the-art performance across multiple deep research benchmarks. Both the dataset and code are publicly released.

Technology Category

Application Category

📝 Abstract
Deep-research agents are capable of executing multi-step web exploration, targeted retrieval, and sophisticated question answering. Despite their powerful capabilities, deep-research agents face two critical bottlenecks: (1) the lack of large-scale, challenging datasets with real-world difficulty, and (2) the absence of accessible, open-source frameworks for data synthesis and agent training. To bridge these gaps, we first construct DeepResearch-9K, a large-scale challenging dataset specifically designed for deep-research scenarios built from open-source multi-hop question-answering (QA) datasets via a low-cost autonomous pipeline. Notably, it consists of (1) 9000 questions spanning three difficulty levels from L1 to L3 (2) high-quality search trajectories with reasoning chains from Tongyi-DeepResearch-30B-A3B, a state-of-the-art deep-research agent, and (3) verifiable answers. Furthermore, we develop an open-source training framework DeepResearch-R1 that supports (1) multi-turn web interactions, (2) different reinforcement learning (RL) approaches, and (3) different reward models such as rule-based outcome reward and LLM-as-judge feedback. Finally, empirical results demonstrate that agents trained on DeepResearch-9K under our DeepResearch-R1 achieve state-of-the-art results on challenging deep-research benchmarks. We release the DeepResearch-9K dataset on https://huggingface.co/datasets/artillerywu/DeepResearch-9K and the code of DeepResearch-R1 on https://github.com/Applied-Machine-Learning-Lab/DeepResearch-R1.
Problem

Research questions and friction points this paper is trying to address.

deep-research agents
benchmark dataset
data synthesis
agent training
multi-hop QA
Innovation

Methods, ideas, or system contributions that make the work stand out.

Deep-research agent
benchmark dataset
multi-hop QA
reinforcement learning framework
LLM-as-judge
🔎 Similar Papers
2023-08-22Frontiers Comput. Sci.Citations: 866