SynPlanResearch-R1: Encouraging Tool Exploration for Deep Research with Synthetic Plans

📅 2026-03-09

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This work addresses the limitations of existing intelligent agents in tool-augmented web retrieval, where insufficient exploration—such as premature termination or biased tool usage—hinders effective reinforcement learning. To overcome this, we propose SynPlanResearch-R1, a novel framework that, for the first time, integrates synthetically generated exploratory tool-use trajectories into the training pipeline. During the cold-start phase, supervised fine-tuning guides deep exploration, yielding high-quality initialization for subsequent reinforcement learning with verifiable rewards (RLVR). By unifying synthetic planning, supervised fine-tuning, and RLVR, our approach enables end-to-end training. Evaluated across seven benchmarks, SynPlanResearch-R1 achieves performance gains of up to 6.0% and 5.8% with Qwen3-8B and Qwen3-4B, respectively, significantly outperforming state-of-the-art methods and markedly enhancing reasoning capabilities in multi-hop and open-web tasks.

Technology Category

Application Category

📝 Abstract

Research Agents enable models to gather information from the web using tools to answer user queries, requiring them to dynamically interleave internal reasoning with tool use. While such capabilities can in principle be learned via reinforcement learning with verifiable rewards (RLVR), we observe that agents often exhibit poor exploration behaviors, including premature termination and biased tool usage. As a result, RLVR alone yields limited improvements. We propose SynPlanResearch-R1, a framework that synthesizes tool-use trajectories that encourage deeper exploration to shape exploration during cold-start supervised fine-tuning, providing a strong initialization for subsequent RL. Across seven multi-hop and open-web benchmarks, \framework improves performance by up to 6.0% on Qwen3-8B and 5.8% on Qwen3-4B backbones respectively compared to SOTA baselines. Further analyses of tool-use patterns and training dynamics compared to baselines shed light on the factors underlying these gains. Our code is publicly available at https://github.com/HansiZeng/syn-plan-research.

Problem

Research questions and friction points this paper is trying to address.

tool exploration

research agents

reinforcement learning

premature termination

biased tool usage

Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic Plans

Tool Exploration

Research Agents