🤖 AI Summary
This work addresses the trade-off between high computational overhead of large language models (LLMs) and weak performance of small models in agentic reinforcement learning. We propose a process-oriented reward shaping method that enables efficient policy optimization on the TravelPlanner benchmark—without auxiliary mechanisms such as curriculum learning. By providing fine-grained, step-level reward signals, our approach significantly enhances small models’ sensitivity to reward feedback during training. Experiments demonstrate that an 8B-parameter model achieves a 56.9% task success rate within only 180 training iterations—outperforming the GPT-5 baseline by 2.7× and establishing a new state-of-the-art on the public leaderboard. Moreover, our method reduces computational and memory costs by 3.5× and 1.5×, respectively, while maintaining strong generalization across multiple out-of-domain tasks. The core contribution is the empirical validation that lightweight models, guided by process-aware reward shaping, can simultaneously surpass LLMs in both efficiency and task performance.
📝 Abstract
We investigated Agentic RL with large language models on the extsc{TravelPlanner} benchmark. Our approach, extsc{Planner-R1}, achieved a extbf{56.9%} final-pass rate with only 180 training queries, a $2.7 imes$ improvement over GPT-5's $21.2%$ baseline and the strongest agentic result on the public leaderboard. A central finding was that smaller models (8B) were highly responsive to reward shaping: with dense process-level signals, they reached competitive performance while being $3.5 imes$ more compute-efficient and $1.5 imes$ more memory-efficient than 32B models. Larger models were more robust under sparse rewards but exhibited smaller relative gains from shaping and higher variance across runs. While curriculum learning offered no significant benefit, shaped rewards consistently amplified learning dynamics, making 8B models the most efficient setting for agentic RL. Crucially, these gains did not come at the cost of overfitting: fine-tuned models mostly maintained or exceeded baseline performance on out-of-domain tasks, including extsc{Multi-IF}, extsc{NaturalPlan}, and $τ$- extsc{Bench}. These results establish reward shaping as a decisive lever for scaling agentic RL, highlight the competitive strength of smaller models, and demonstrate that efficiency can be achieved without sacrificing generalization.