SynPlanResearch-R1: Encouraging Tool Exploration for Deep Research with Synthetic Plans

📅 2026-03-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing intelligent agents in tool-augmented web retrieval, where insufficient exploration—such as premature termination or biased tool usage—hinders effective reinforcement learning. To overcome this, we propose SynPlanResearch-R1, a novel framework that, for the first time, integrates synthetically generated exploratory tool-use trajectories into the training pipeline. During the cold-start phase, supervised fine-tuning guides deep exploration, yielding high-quality initialization for subsequent reinforcement learning with verifiable rewards (RLVR). By unifying synthetic planning, supervised fine-tuning, and RLVR, our approach enables end-to-end training. Evaluated across seven benchmarks, SynPlanResearch-R1 achieves performance gains of up to 6.0% and 5.8% with Qwen3-8B and Qwen3-4B, respectively, significantly outperforming state-of-the-art methods and markedly enhancing reasoning capabilities in multi-hop and open-web tasks.

Technology Category

Application Category

📝 Abstract
Research Agents enable models to gather information from the web using tools to answer user queries, requiring them to dynamically interleave internal reasoning with tool use. While such capabilities can in principle be learned via reinforcement learning with verifiable rewards (RLVR), we observe that agents often exhibit poor exploration behaviors, including premature termination and biased tool usage. As a result, RLVR alone yields limited improvements. We propose SynPlanResearch-R1, a framework that synthesizes tool-use trajectories that encourage deeper exploration to shape exploration during cold-start supervised fine-tuning, providing a strong initialization for subsequent RL. Across seven multi-hop and open-web benchmarks, \framework improves performance by up to 6.0% on Qwen3-8B and 5.8% on Qwen3-4B backbones respectively compared to SOTA baselines. Further analyses of tool-use patterns and training dynamics compared to baselines shed light on the factors underlying these gains. Our code is publicly available at https://github.com/HansiZeng/syn-plan-research.
Problem

Research questions and friction points this paper is trying to address.

tool exploration
research agents
reinforcement learning
premature termination
biased tool usage
Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic Plans
Tool Exploration
Research Agents
Supervised Fine-tuning
Reinforcement Learning with Verifiable Rewards
🔎 Similar Papers
No similar papers found.