DeepPlanner: Scaling Planning Capability for Deep Research Agents via Advantage Shaping

📅 2025-10-14

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

Existing large language model (LLM)-based research agents suffer from insufficient planning-phase optimization in long-horizon tasks; under reinforcement learning (RL), planning-token entropy remains excessively high, indicating inadequate modeling of decision uncertainty. Method: We propose an end-to-end RL framework for planning optimization, featuring (i) entropy-aware token-level advantage shaping, (ii) sample-level advantage upsampling for high-planning-intensity trajectories, (iii) an entropy-regularized advantage function, (iv) token-level dynamic policy updates, and (v) rollout-level advantage reweighting. Contribution/Results: Evaluated across seven deep research benchmarks, our method significantly improves planning quality and task performance while reducing training cost by over 30%. It is the first approach to systematically address gradient sparsity and optimization inefficiency in the planning module of LLM-based research agents.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) augmented with multi-step reasoning and action generation abilities have shown promise in leveraging external tools to tackle complex tasks that require long-horizon planning. However, existing approaches either rely on implicit planning in the reasoning stage or introduce explicit planners without systematically addressing how to optimize the planning stage. As evidence, we observe that under vanilla reinforcement learning (RL), planning tokens exhibit significantly higher entropy than other action tokens, revealing uncertain decision points that remain under-optimized. To address this, we propose DeepPlanner, an end-to-end RL framework that effectively enhances the planning capabilities of deep research agents. Our approach shapes token-level advantage with an entropy-based term to allocate larger updates to high entropy tokens, and selectively upweights sample-level advantages for planning-intensive rollouts. Extensive experiments across seven deep research benchmarks demonstrate that DeepPlanner improves planning quality and achieves state-of-the-art results under a substantially lower training budget.

Problem

Research questions and friction points this paper is trying to address.

Optimizing uncertain planning tokens in reinforcement learning

Enhancing planning capabilities of deep research agents

Improving planning quality with lower training budgets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Shapes token-level advantage with entropy-based term

Selectively upweights sample-level advantages for planning

Enhances planning capabilities via end-to-end RL framework

🔎 Similar Papers

No similar papers found.