To Call or Not to Call: A Framework to Assess and Optimize LLM Tool Calling

📅 2026-05-01

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This work addresses the challenge that large language models often exhibit redundant or harmful behaviors when invoking external tools—such as web search—due to an inability to accurately assess the necessity of such calls. The authors propose a lightweight, decision-theoretic control framework that evaluates tool usage along three dimensions: necessity, utility, and cost-efficiency. For the first time, they systematically quantify the misalignment between genuine task requirements and the model’s self-perceived need for tool assistance. By training necessity and utility estimators on the model’s hidden states and integrating normative with descriptive analyses, the framework enables more effective tool-invocation decisions. Experiments across three tasks and six models demonstrate that this approach significantly outperforms the models’ native decision mechanisms, yielding consistent improvements in both task performance and tool-utilization efficiency.

📝 Abstract

Agentic AI architectures augment LLMs with external tools, unlocking strong capabilities. However, tool use is not always beneficial; some calls may be redundant or even harmful. Effective tool use, therefore, hinges on a core LLM decision: whether to call or not call a tool, when performing a task. This decision is particularly challenging for web search tools, where the benefits of external information depend on the model's internal knowledge and its ability to integrate potentially noisy tool responses. We introduce a principled framework inspired by decision-making theory to evaluate web search tool-use decisions along three key factors: necessity, utility, and affordability. Our analysis combines two complementary lenses: a normative perspective that infers true need and utility from an optimal allocation of tool calls, and a descriptive perspective that infers the model's self-perceived need and utility from their observed behaviors. We find that models' perceived need and utility of tool calls are often misaligned with their true need and utility. Building on this framework, we train lightweight estimators of need and utility based on models' hidden states. Our estimators enable simple controllers that can improve decision quality and lead to stronger task performance than the self-perceived set up across three tasks and six models.

Problem

Research questions and friction points this paper is trying to address.

tool calling

LLM decision-making

web search

necessity and utility

agentic AI

Innovation

Methods, ideas, or system contributions that make the work stand out.

tool calling

decision-making framework

large language models