DeepResearch-9K: A Challenging Benchmark Dataset of Deep-Research Agent

📅 2026-03-01

📈 Citations: 0

✨ Influential: 0

career value

244K/year

🤖 AI Summary

This work addresses the scarcity of large-scale datasets and open-source training frameworks that reflect real-world complexity for deep research agents. To this end, we introduce DeepResearch-9K, a dataset comprising 9,000 multi-difficulty research-oriented questions, along with DeepResearch-R1, an open-source training framework supporting multi-turn web interaction and reinforcement learning. We propose a cost-effective method to automatically generate high-difficulty search trajectories from multi-hop question answering, augmented with reasoning chains produced by the Tongyi-DeepResearch-30B-A3B model. Leveraging LLM-as-judge and reinforcement learning, our approach trains agents that achieve state-of-the-art performance across multiple deep research benchmarks. Both the dataset and code are publicly released.

Technology Category

Application Category

📝 Abstract

Deep-research agents are capable of executing multi-step web exploration, targeted retrieval, and sophisticated question answering. Despite their powerful capabilities, deep-research agents face two critical bottlenecks: (1) the lack of large-scale, challenging datasets with real-world difficulty, and (2) the absence of accessible, open-source frameworks for data synthesis and agent training. To bridge these gaps, we first construct DeepResearch-9K, a large-scale challenging dataset specifically designed for deep-research scenarios built from open-source multi-hop question-answering (QA) datasets via a low-cost autonomous pipeline. Notably, it consists of (1) 9000 questions spanning three difficulty levels from L1 to L3 (2) high-quality search trajectories with reasoning chains from Tongyi-DeepResearch-30B-A3B, a state-of-the-art deep-research agent, and (3) verifiable answers. Furthermore, we develop an open-source training framework DeepResearch-R1 that supports (1) multi-turn web interactions, (2) different reinforcement learning (RL) approaches, and (3) different reward models such as rule-based outcome reward and LLM-as-judge feedback. Finally, empirical results demonstrate that agents trained on DeepResearch-9K under our DeepResearch-R1 achieve state-of-the-art results on challenging deep-research benchmarks. We release the DeepResearch-9K dataset on https://huggingface.co/datasets/artillerywu/DeepResearch-9K and the code of DeepResearch-R1 on https://github.com/Applied-Machine-Learning-Lab/DeepResearch-R1.

Problem

Research questions and friction points this paper is trying to address.

deep-research agents

benchmark dataset

data synthesis

agent training

multi-hop QA

Innovation

Methods, ideas, or system contributions that make the work stand out.

Deep-research agent

benchmark dataset

multi-hop QA