🤖 AI Summary
This work addresses the limitation of current reinforcement learning agents that rely solely on sparse outcome-based rewards, which impedes effective evaluation of intermediate reasoning steps and hinders performance on complex tasks. To overcome this, the authors propose Agent-RRM, a multidimensional reward model that introduces, for the first time, a structured feedback mechanism tailored to the reasoning process. This mechanism comprises explicit reasoning trajectories, focused critiques, and holistic scores. Three integration strategies are developed: reasoning-enhanced refinement (Reagent-C), reward-augmented guidance (Reagent-R), and unified feedback integration (Reagent-U). Extensive experiments across twelve benchmarks demonstrate that Reagent-U substantially enhances agent reasoning capabilities, achieving state-of-the-art accuracy of 43.7% on GAIA and 46.2% on WebWalkerQA, significantly outperforming existing approaches.
📝 Abstract
Agentic Reinforcement Learning (Agentic RL) has achieved notable success in enabling agents to perform complex reasoning and tool use. However, most methods still relies on sparse outcome-based reward for training. Such feedback fails to differentiate intermediate reasoning quality, leading to suboptimal training results. In this paper, we introduce Agent Reasoning Reward Model (Agent-RRM), a multi-faceted reward model that produces structured feedback for agentic trajectories, including (1) an explicit reasoning trace , (2) a focused critique that provides refinement guidance by highlighting reasoning flaws, and (3) an overall score that evaluates process performance. Leveraging these signals, we systematically investigate three integration strategies: Reagent-C (text-augmented refinement), Reagent-R (reward-augmented guidance), and Reagent-U (unified feedback integration). Extensive evaluations across 12 diverse benchmarks demonstrate that Reagent-U yields substantial performance leaps, achieving 43.7% on GAIA and 46.2% on WebWalkerQA, validating the effectiveness of our reasoning reward model and training schemes. Code, models, and datasets are all released to facilitate future research.