One Model to Critique Them All: Rewarding Agentic Tool-Use via Efficient Reasoning

๐Ÿ“… 2025-10-30
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
The tool learning community has long lacked a dedicated reward model (RM) tailored for function-calling tasks, hindering agent performance improvements. To address this, we propose ToolRMโ€”a lightweight generative reward modelโ€”and introduce ToolPref-Pairwise-30K, the first large-scale, general-purpose preference dataset for tool invocation. We design a novel data generation pipeline integrating multi-dimensional sampling and rule-based scoring to significantly enhance RM generalization across diverse critique tasks. ToolRM is trained on the Qwen3 family of models and supports inference-time optimization strategies such as Best-of-N sampling and self-correction. Experiments demonstrate that ToolRM achieves up to a 14.28% accuracy gain on TRBench_BFCL, outperforming Claude 4 and OpenAI o3; on ACEBench, it reduces output token count by over 66%, validating its efficiency and scalability in reasoning-intensive scenarios.

Technology Category

Application Category

๐Ÿ“ Abstract
Reward models (RMs) play a critical role in aligning large language models (LLMs) with human preferences. Yet in the domain of tool learning, the lack of RMs specifically designed for function-calling tasks has limited progress toward more capable agentic AI. We introduce ToolRM, a family of lightweight generative RMs tailored for general tool-use scenarios. To build these models, we propose a novel pipeline that constructs pairwise preference data using rule-based scoring and multidimensional sampling. This yields ToolPref-Pairwise-30K, a diverse, balanced, and challenging dataset of critique tasks that supports reinforcement learning with verifiable feedback. To evaluate tool-use RMs, we also introduce TRBench$_{BFCL}$, a benchmark built on the agentic evaluation suite BFCL. Trained on our constructed data, models from the Qwen3-4B/8B series achieve up to 14.28% higher accuracy, substantially outperforming frontier models such as Claude 4 and OpenAI o3 in pairwise reward judgments. Beyond training objectives, ToolRM generalizes to broader critique tasks, including Best-of-N sampling and self-correction. Experiments on ACEBench highlight its effectiveness and efficiency, enabling inference-time scaling and reducing output token usage by over 66%. We release data and model checkpoints to facilitate future research.
Problem

Research questions and friction points this paper is trying to address.

Develops lightweight reward models for agentic tool-use scenarios
Creates diverse pairwise preference dataset for reinforcement learning
Evaluates tool-use reward models with specialized benchmark TRBench
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight generative reward models for tool-use
Rule-based scoring and sampling for preference data
Generalizes to critique tasks and reduces tokens
๐Ÿ”Ž Similar Papers
No similar papers found.