How to Train Your Deep Research Agent? Prompt, Reward, and Policy Optimization in Search-R1

πŸ“… 2026-02-23
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This study addresses the unclear mechanistic role of reinforcement learning in current deep research agents, particularly the lack of systematic analysis on the interplay among prompt templates, reward functions, and policy optimization. To this end, the work disentangles these three key dimensions and proposes targeted improvements: it identifies the β€œFast Thinking” prompt template as superior, designs an F1-based reward mechanism incorporating action-level penalties, and demonstrates the advantages of REINFORCE in training efficiency and stability over alternatives like PPO and GRPO. Building upon a multi-turn retrieval-decision framework, the authors develop an enhanced baseline, Search-R1++, which integrates these components. Experimental results show that this approach improves Search-R1 performance from 0.403 to 0.442 on Qwen2.5-7B and from 0.289 to 0.331 on Qwen2.5-3B.

Technology Category

Application Category

πŸ“ Abstract
Deep Research agents tackle knowledge-intensive tasks through multi-round retrieval and decision-oriented generation. While reinforcement learning (RL) has been shown to improve performance in this paradigm, its contributions remain underexplored. To fully understand the role of RL, we conduct a systematic study along three decoupled dimensions: prompt template, reward function, and policy optimization. Our study reveals that: 1) the Fast Thinking template yields greater stability and better performance than the Slow Thinking template used in prior work; 2) the F1-based reward underperforms the EM due to training collapse driven by answer avoidance; this can be mitigated by incorporating action-level penalties, ultimately surpassing EM; 3) REINFORCE outperforms PPO while requiring fewer search actions, whereas GRPO shows the poorest stability among policy optimization methods. Building on these insights, we then introduce Search-R1++, a strong baseline that improves the performance of Search-R1 from 0.403 to 0.442 (Qwen2.5-7B) and 0.289 to 0.331 (Qwen2.5-3B). We hope that our findings can pave the way for more principled and reliable RL training strategies in Deep Research systems.
Problem

Research questions and friction points this paper is trying to address.

Deep Research
Reinforcement Learning
Prompt Template
Reward Function
Policy Optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Deep Research Agent
Reinforcement Learning
Prompt Template
Reward Function
Policy Optimization
πŸ”Ž Similar Papers
Y
Yinuo Xu
NLPR & MAIS, CASIA; School of AI, UCAS
S
Shuo Lu
NLPR & MAIS, CASIA
J
Jianjie Cheng
Meituan Inc.
M
Meng Wang
Meituan Inc.
Q
Qianlong Xie
Meituan Inc.
X
Xingxing Wang
Meituan Inc.
R
Ran He
NLPR & MAIS, CASIA
Jian Liang
Jian Liang
Kuaishou Inc.
transfer learninggraph learning