AgentPRM: Process Reward Models for LLM Agents via Step-Wise Promise and Progress

📅 2025-11-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) struggle with multi-step decision-making tasks (e.g., web navigation, e-commerce shopping) due to the absence of fine-grained process-level feedback and difficulty in evaluating individual action quality. Method: This paper proposes a Process Reward Model (PRM) for agent-oriented tasks, introducing *stepwise commitment* and *progress assessment* mechanisms to jointly model sequential dependency and goal advancement; it employs temporal-difference learning combined with Generalized Advantage Estimation (GAE) to efficiently generate high-quality training labels. Contribution/Results: Experiments demonstrate over 8× computational efficiency improvement, stable performance gains under resource scaling, and significantly enhanced exploration-exploitation trade-off in LLM-based agents. The PRM establishes a scalable, process-aware supervision paradigm for reinforcement learning–based agent training, enabling precise, step-level reward shaping without requiring dense human annotations.

Technology Category

Application Category

📝 Abstract
Despite rapid development, large language models (LLMs) still encounter challenges in multi-turn decision-making tasks (i.e., agent tasks) like web shopping and browser navigation, which require making a sequence of intelligent decisions based on environmental feedback. Previous work for LLM agents typically relies on elaborate prompt engineering or fine-tuning with expert trajectories to improve performance. In this work, we take a different perspective: we explore constructing process reward models (PRMs) to evaluate each decision and guide the agent's decision-making process. Unlike LLM reasoning, where each step is scored based on correctness, actions in agent tasks do not have a clear-cut correctness. Instead, they should be evaluated based on their proximity to the goal and the progress they have made. Building on this insight, we propose a re-defined PRM for agent tasks, named AgentPRM, to capture both the interdependence between sequential decisions and their contribution to the final goal. This enables better progress tracking and exploration-exploitation balance. To scalably obtain labeled data for training AgentPRM, we employ a Temporal Difference-based (TD-based) estimation method combined with Generalized Advantage Estimation (GAE), which proves more sample-efficient than prior methods. Extensive experiments across different agentic tasks show that AgentPRM is over $8 imes$ more compute-efficient than baselines, and it demonstrates robust improvement when scaling up test-time compute. Moreover, we perform detailed analyses to show how our method works and offer more insights, e.g., applying AgentPRM to the reinforcement learning of LLM agents.
Problem

Research questions and friction points this paper is trying to address.

Improving multi-turn decision-making in LLM agents through process rewards
Evaluating agent actions based on goal proximity and progress
Developing scalable reward models for efficient agent training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Process reward models evaluate sequential agent decisions
Temporal difference estimation enables scalable training data generation
Step-wise progress tracking balances exploration and exploitation
🔎 Similar Papers
No similar papers found.
Zhiheng Xi
Zhiheng Xi
Fudan University
LLM ReasoningLLM-based Agents
C
Chenyang Liao
Fudan University
G
Guanyu Li
Fudan University
Y
Yajie Yang
Fudan University
Wenxiang Chen
Wenxiang Chen
Fudan University
LLM reasoningLLM-based agent
Z
Zhihao Zhang
Fudan University
B
Binghai Wang
Fudan University
Senjie Jin
Senjie Jin
Fudan University
natural language processing
Y
Yuhao Zhou
Fudan University
J
Jian Guan
Ant Group
W
Wei Wu
Ant Group
Tao Ji
Tao Ji
中国人民大学
T
Tao Gui
Fudan University
Q
Qi Zhang
Fudan University
X
Xuanjing Huang
Fudan University