ToolRM: Outcome Reward Models for Tool-Calling Large Language Models

📅 2025-09-15

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Existing reward models struggle to reliably evaluate LLMs’ tool-calling capabilities due to overreliance on natural language outputs rather than executable outcomes. Method: We propose a task-outcome-based reward modeling framework: (1) introducing FC-RewardBench—the first benchmark explicitly designed for tool-call evaluation; (2) synthesizing high-quality, commercially licensable training data using open-source LLMs; (3) designing a reward-guided data filtering mechanism; and (4) performing supervised fine-tuning across model scales (1.7B–14B). Crucially, our approach replaces linguistic output supervision with actual tool execution results as the primary reward signal. Contribution/Results: Our method significantly improves cross-domain generalization, outperforming general-purpose baselines by 25% on average across seven diverse benchmarks. It enables efficient data curation and lightweight model adaptation, establishing a novel, trustworthy paradigm for evaluating tool-augmented LLMs.

Technology Category

Application Category

📝 Abstract

As large language models (LLMs) increasingly interact with external tools, reward modeling for tool use has become a critical yet underexplored area. Existing reward models, trained primarily on natural language outputs, struggle to evaluate tool-based reasoning and execution. To quantify this gap, we introduce FC-RewardBench, the first benchmark designed to systematically assess reward models' performance in tool-calling scenarios. Our analysis shows that current reward models often miss key signals of effective tool use, highlighting the need for domain-specific modeling. To address this, we propose a training framework for outcome-based reward models using data synthesized from permissively licensed, open-weight LLMs. We train models ranging from 1.7B to 14B parameters and evaluate them across seven out-of-domain benchmarks. These models consistently outperform general-purpose baselines, achieving up to 25% average improvement in downstream task performance and enabling data-efficient fine-tuning through reward-guided filtering.

Problem

Research questions and friction points this paper is trying to address.

Evaluating reward models for tool-based reasoning in LLMs

Addressing performance gap in tool-calling reward benchmarks

Developing outcome-based reward models for improved tool execution

Innovation

Methods, ideas, or system contributions that make the work stand out.

Outcome-based reward models for tools

Training framework with open-weight LLMs

Evaluated across seven out-of-domain benchmarks

🔎 Similar Papers

No similar papers found.