ToolPRMBench: Evaluating and Advancing Process Reward Models for Tool-using Agents

📅 2026-01-18

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

This work addresses the absence of a systematic evaluation benchmark for process reward models (PRMs) tailored to tool-using agents. To bridge this gap, we introduce ToolPRMBench—the first PRM evaluation benchmark specifically designed for this setting—constructed from representative tool-use tasks and featuring step-level trajectory datasets that incorporate both single-step and multi-step errors via offline and online sampling strategies. To enhance data quality and mitigate annotation noise, we further propose a multi-LLM collaborative verification mechanism. Experimental results demonstrate that PRMs trained on ToolPRMBench significantly outperform general-purpose PRMs, thereby validating the benchmark’s effectiveness and practical utility. Our work establishes a reliable standard for future PRM development and evaluation in tool-augmented agent scenarios.

Technology Category

Application Category

📝 Abstract

Reward-guided search methods have demonstrated strong potential in enhancing tool-using agents by effectively guiding sampling and exploration over complex action spaces. As a core design, those search methods utilize process reward models (PRMs) to provide step-level rewards, enabling more fine-grained monitoring. However, there is a lack of systematic and reliable evaluation benchmarks for PRMs in tool-using settings. In this paper, we introduce ToolPRMBench, a large-scale benchmark specifically designed to evaluate PRMs for tool-using agents. ToolPRMBench is built on top of several representative tool-using benchmarks and converts agent trajectories into step-level test cases. Each case contains the interaction history, a correct action, a plausible but incorrect alternative, and relevant tool metadata. We respectively utilize offline sampling to isolate local single-step errors and online sampling to capture realistic multi-step failures from full agent rollouts. A multi-LLM verification pipeline is proposed to reduce label noise and ensure data quality. We conduct extensive experiments across large language models, general PRMs, and tool-specialized PRMs on ToolPRMBench. The results reveal clear differences in PRM effectiveness and highlight the potential of specialized PRMs for tool-using. Code and data will be released at https://github.com/David-Li0406/ToolPRMBench.

Problem

Research questions and friction points this paper is trying to address.

Process Reward Models

Tool-using Agents

Evaluation Benchmark

Reward-guided Search

Step-level Rewards

Innovation

Methods, ideas, or system contributions that make the work stand out.

Process Reward Models

Tool-using Agents

Benchmark Evaluation