ToolCaching: Towards Efficient Caching for LLM Tool-calling

📅 2026-01-20

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

This work addresses the inefficacy of traditional caching in large language model (LLM) tool invocation, which stems from semantic heterogeneity, dynamic workloads, and varying data freshness. To overcome these challenges, the authors propose ToolCaching, a novel framework that introduces semantic awareness and dynamic value assessment into LLM tool caching. ToolCaching jointly leverages semantic and system-level features to determine request cacheability and employs a multi-armed bandit–based VAAC algorithm for adaptive admission control and multi-factor, value-driven cache eviction. Experimental results on both synthetic and public datasets demonstrate that ToolCaching achieves up to an 11% improvement in cache hit rate and reduces latency by 34% compared to baseline strategies, substantially enhancing the efficiency of tool invocation in LLM systems.

Technology Category

Application Category

📝 Abstract

Recent advances in Large Language Models (LLMs) have revolutionized web applications, enabling intelligent search, recommendation, and assistant services with natural language interfaces. Tool-calling extends LLMs with the ability to interact with external APIs, greatly enhancing their practical utility. While prior research has improved tool-calling performance by adopting traditional computer systems techniques, such as parallel and asynchronous execution, the challenge of redundant or repeated tool-calling requests remains largely unaddressed. Caching is a classic solution to this problem, but applying it to LLM tool-calling introduces new difficulties due to heterogeneous request semantics, dynamic workloads, and varying freshness requirements, which render conventional cache policies ineffective. To address these issues, we propose ToolCaching, an efficient feature-driven and adaptive caching framework for LLM tool-calling systems. ToolCaching systematically integrates semantic and system-level features to evaluate request cacheability and estimate caching value. At its core, the VAAC algorithm integrates bandit-based admission with value-driven, multi-factor eviction, jointly accounting for request frequency, recency, and caching value. Extensive experiments on synthetic and public tool-calling workloads demonstrate that ToolCaching with VAAC achieves up to 11% higher cache hit ratios and 34% lower latency compared to standard policies, effectively accelerating LLM tool-calling in practical applications.

Problem

Research questions and friction points this paper is trying to address.

LLM tool-calling

caching

redundant requests

cache hit ratio

latency

Innovation

Methods, ideas, or system contributions that make the work stand out.

ToolCaching

LLM tool-calling

adaptive caching