Planner-R1: Reward Shaping Enables Efficient Agentic RL with Smaller LLMs

📅 2025-09-30

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

This work addresses the trade-off between high computational overhead of large language models (LLMs) and weak performance of small models in agentic reinforcement learning. We propose a process-oriented reward shaping method that enables efficient policy optimization on the TravelPlanner benchmark—without auxiliary mechanisms such as curriculum learning. By providing fine-grained, step-level reward signals, our approach significantly enhances small models’ sensitivity to reward feedback during training. Experiments demonstrate that an 8B-parameter model achieves a 56.9% task success rate within only 180 training iterations—outperforming the GPT-5 baseline by 2.7× and establishing a new state-of-the-art on the public leaderboard. Moreover, our method reduces computational and memory costs by 3.5× and 1.5×, respectively, while maintaining strong generalization across multiple out-of-domain tasks. The core contribution is the empirical validation that lightweight models, guided by process-aware reward shaping, can simultaneously surpass LLMs in both efficiency and task performance.

Technology Category

Application Category

📝 Abstract

We investigated Agentic RL with large language models on the extsc{TravelPlanner} benchmark. Our approach, extsc{Planner-R1}, achieved a extbf{56.9%} final-pass rate with only 180 training queries, a $2.7 imes$ improvement over GPT-5's $21.2%$ baseline and the strongest agentic result on the public leaderboard. A central finding was that smaller models (8B) were highly responsive to reward shaping: with dense process-level signals, they reached competitive performance while being $3.5 imes$ more compute-efficient and $1.5 imes$ more memory-efficient than 32B models. Larger models were more robust under sparse rewards but exhibited smaller relative gains from shaping and higher variance across runs. While curriculum learning offered no significant benefit, shaped rewards consistently amplified learning dynamics, making 8B models the most efficient setting for agentic RL. Crucially, these gains did not come at the cost of overfitting: fine-tuned models mostly maintained or exceeded baseline performance on out-of-domain tasks, including extsc{Multi-IF}, extsc{NaturalPlan}, and $τ$- extsc{Bench}. These results establish reward shaping as a decisive lever for scaling agentic RL, highlight the competitive strength of smaller models, and demonstrate that efficiency can be achieved without sacrificing generalization.

Problem

Research questions and friction points this paper is trying to address.

Improving agentic RL efficiency with smaller language models

Enhancing training performance through reward shaping techniques

Achieving computational efficiency without sacrificing generalization capability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reward shaping enables efficient agentic RL

Smaller models achieve competitive performance

Efficiency gains maintain generalization across tasks

🔎 Similar Papers

Improving Planning with Large Language Models: A Modular Agentic Architecture