Exploring Reasoning Reward Model for Agents

📅 2026-01-29

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work addresses the limitation of current reinforcement learning agents that rely solely on sparse outcome-based rewards, which impedes effective evaluation of intermediate reasoning steps and hinders performance on complex tasks. To overcome this, the authors propose Agent-RRM, a multidimensional reward model that introduces, for the first time, a structured feedback mechanism tailored to the reasoning process. This mechanism comprises explicit reasoning trajectories, focused critiques, and holistic scores. Three integration strategies are developed: reasoning-enhanced refinement (Reagent-C), reward-augmented guidance (Reagent-R), and unified feedback integration (Reagent-U). Extensive experiments across twelve benchmarks demonstrate that Reagent-U substantially enhances agent reasoning capabilities, achieving state-of-the-art accuracy of 43.7% on GAIA and 46.2% on WebWalkerQA, significantly outperforming existing approaches.

Technology Category

Application Category

📝 Abstract

Agentic Reinforcement Learning (Agentic RL) has achieved notable success in enabling agents to perform complex reasoning and tool use. However, most methods still relies on sparse outcome-based reward for training. Such feedback fails to differentiate intermediate reasoning quality, leading to suboptimal training results. In this paper, we introduce Agent Reasoning Reward Model (Agent-RRM), a multi-faceted reward model that produces structured feedback for agentic trajectories, including (1) an explicit reasoning trace , (2) a focused critique that provides refinement guidance by highlighting reasoning flaws, and (3) an overall score that evaluates process performance. Leveraging these signals, we systematically investigate three integration strategies: Reagent-C (text-augmented refinement), Reagent-R (reward-augmented guidance), and Reagent-U (unified feedback integration). Extensive evaluations across 12 diverse benchmarks demonstrate that Reagent-U yields substantial performance leaps, achieving 43.7% on GAIA and 46.2% on WebWalkerQA, validating the effectiveness of our reasoning reward model and training schemes. Code, models, and datasets are all released to facilitate future research.

Problem

Research questions and friction points this paper is trying to address.

Agentic Reinforcement Learning

reasoning reward

sparse reward

intermediate reasoning quality

reward modeling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reasoning Reward Model

Agentic Reinforcement Learning

Structured Feedback