When2Call: When (not) to Call Tools

📅 2025-04-26

📈 Citations: 0

✨ Influential: 0

career value

156K/year

🤖 AI Summary

Existing tool-use benchmarks exclusively evaluate whether tools are invoked correctly, neglecting critical decision-making capabilities—such as *when to invoke*, *when to seek clarification*, and *when to abstain from answering*. This work introduces When2Call, the first benchmark explicitly designed to assess the *timing* and *reasonableness* of tool invocation decisions. Methodologically, we propose a multiple-choice evaluation paradigm, synthesize high-quality training data, and adopt a preference-optimization-based fine-tuning strategy—outperforming standard supervised fine-tuning. Experiments reveal severe deficiencies in the decision reasonableness of current state-of-the-art tool-use models; our approach achieves substantial gains on When2Call (+23.6% average accuracy). To foster reproducibility and community advancement, we fully open-source the benchmark, datasets, and evaluation code.

Technology Category

Application Category

📝 Abstract

Leveraging external tools is a key feature for modern Language Models (LMs) to expand their capabilities and integrate them into existing systems. However, existing benchmarks primarily focus on the accuracy of tool calling -- whether the correct tool is called with the correct parameters -- and less on evaluating when LMs should (not) call tools. We develop a new benchmark, When2Call, which evaluates tool-calling decision-making: when to generate a tool call, when to ask follow-up questions and when to admit the question can't be answered with the tools provided. We find that state-of-the-art tool-calling LMs show significant room for improvement on When2Call, indicating the importance of this benchmark. We also develop a training set for When2Call and leverage the multiple-choice nature of the benchmark to develop a preference optimization training regime, which shows considerably more improvement than traditional fine-tuning. We release the benchmark and training data as well as evaluation scripts at https://github.com/NVIDIA/When2Call.

Problem

Research questions and friction points this paper is trying to address.

Evaluating when LMs should or should not call tools

Assessing decision-making for tool calls, follow-ups, and admissions

Improving tool-calling LMs with new benchmark and training methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Develops When2Call benchmark for tool-calling decisions

Uses preference optimization for training improvement

Evaluates when to call tools or admit limitations

🔎 Similar Papers

Can Tool-augmented Large Language Models be Aware of Incomplete Conditions?