🤖 AI Summary
Existing academic benchmarks (e.g., BrowseComp) inadequately support open-ended deep research, exhibiting critical deficiencies in intent recognition, long-horizon reasoning, and cross-source verification. To address these limitations, we propose Step-DeepResearch—a novel agent framework for deep research. Our method introduces (1) an atomic-capability-based data synthesis approach; (2) a progressive training paradigm comprising mid-training, supervised fine-tuning, and reinforcement learning; and (3) a checklist-style automated evaluator. We further release ADR-Bench—the first Chinese-language benchmark specifically designed for evaluating deep research capabilities. Experimental results demonstrate that our 32B-parameter model achieves a 61.4% Scale AI research score and significantly outperforms open-source models of comparable scale on ADR-Bench, matching the performance of state-of-the-art closed-source systems (e.g., OpenAI, Gemini).
📝 Abstract
As LLMs shift toward autonomous agents, Deep Research has emerged as a pivotal metric. However, existing academic benchmarks like BrowseComp often fail to meet real-world demands for open-ended research, which requires robust skills in intent recognition, long-horizon decision-making, and cross-source verification. To address this, we introduce Step-DeepResearch, a cost-effective, end-to-end agent. We propose a Data Synthesis Strategy Based on Atomic Capabilities to reinforce planning and report writing, combined with a progressive training path from agentic mid-training to SFT and RL. Enhanced by a Checklist-style Judger, this approach significantly improves robustness. Furthermore, to bridge the evaluation gap in the Chinese domain, we establish ADR-Bench for realistic deep research scenarios. Experimental results show that Step-DeepResearch (32B) scores 61.4% on Scale AI Research Rubrics. On ADR-Bench, it significantly outperforms comparable models and rivals SOTA closed-source models like OpenAI and Gemini DeepResearch. These findings prove that refined training enables medium-sized models to achieve expert-level capabilities at industry-leading cost-efficiency.