Step-DeepResearch Technical Report

📅 2025-12-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing academic benchmarks (e.g., BrowseComp) inadequately support open-ended deep research, exhibiting critical deficiencies in intent recognition, long-horizon reasoning, and cross-source verification. To address these limitations, we propose Step-DeepResearch—a novel agent framework for deep research. Our method introduces (1) an atomic-capability-based data synthesis approach; (2) a progressive training paradigm comprising mid-training, supervised fine-tuning, and reinforcement learning; and (3) a checklist-style automated evaluator. We further release ADR-Bench—the first Chinese-language benchmark specifically designed for evaluating deep research capabilities. Experimental results demonstrate that our 32B-parameter model achieves a 61.4% Scale AI research score and significantly outperforms open-source models of comparable scale on ADR-Bench, matching the performance of state-of-the-art closed-source systems (e.g., OpenAI, Gemini).

Technology Category

Application Category

📝 Abstract
As LLMs shift toward autonomous agents, Deep Research has emerged as a pivotal metric. However, existing academic benchmarks like BrowseComp often fail to meet real-world demands for open-ended research, which requires robust skills in intent recognition, long-horizon decision-making, and cross-source verification. To address this, we introduce Step-DeepResearch, a cost-effective, end-to-end agent. We propose a Data Synthesis Strategy Based on Atomic Capabilities to reinforce planning and report writing, combined with a progressive training path from agentic mid-training to SFT and RL. Enhanced by a Checklist-style Judger, this approach significantly improves robustness. Furthermore, to bridge the evaluation gap in the Chinese domain, we establish ADR-Bench for realistic deep research scenarios. Experimental results show that Step-DeepResearch (32B) scores 61.4% on Scale AI Research Rubrics. On ADR-Bench, it significantly outperforms comparable models and rivals SOTA closed-source models like OpenAI and Gemini DeepResearch. These findings prove that refined training enables medium-sized models to achieve expert-level capabilities at industry-leading cost-efficiency.
Problem

Research questions and friction points this paper is trying to address.

Existing benchmarks fail to meet real-world open-ended research demands.
There is an evaluation gap for deep research in the Chinese domain.
Current models lack robust skills for complex research tasks.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Data Synthesis Strategy Based on Atomic Capabilities
Progressive training path from agentic mid-training to SFT and RL
Checklist-style Judger for enhanced robustness
🔎 Similar Papers
No similar papers found.
Chen Hu
Chen Hu
School of Artificial Intelligence and Computer Science, Jiangnan University
Geometric Deep LearningMachine Learning
H
Haikuo Du
StepFun
H
Heng Wang
StepFun
L
Lin Lin
StepFun
Mingrui Chen
Mingrui Chen
Institute of Automation, Chinese Academy of Sciences
Computer VisionFoundation Models
P
Peng Liu
StepFun
R
Ruihang Miao
StepFun
T
Tianchi Yue
StepFun
W
Wang You
StepFun
W
Wei Ji
StepFun
W
Wei Yuan
StepFun
W
Wenjin Deng
StepFun
X
Xiaojian Yuan
StepFun
X
Xiaoyun Zhang
StepFun
X
Xiangyu Liu
StepFun
X
Xikai Liu
StepFun
Y
Yanming Xu
StepFun
Y
Yicheng Cao
StepFun
Y
Yifei Zhang
StepFun
Y
Yongyao Wang
StepFun
Y
Yubo Shu
StepFun
Y
Yurong Zhang
StepFun
Y
Yuxiang Zhang
StepFun
Z
Zheng Gong
StepFun
Z
Zhichao Chang
StepFun