🤖 AI Summary
Current large language models (LLMs) lack context-aware, personalized modeling in tool invocation, leading to selection bias in overlapping-tool scenarios and diminished user satisfaction. To address this, we introduce ToolSpectrum—the first benchmark for personalized tool calling—formally defining a dual-dimensional personalization paradigm grounded in “user profiling” and “environmental factors.” ToolSpectrum comprises multi-turn, real-world tasks, integrating controlled-variable experiments, human annotation, and automated evaluation to enable fine-grained attribution analysis. Experimental results demonstrate that personalized tool invocation significantly improves user experience; however, state-of-the-art LLMs achieve only 61.8% average accuracy on dual-dimensional joint reasoning, revealing a critical capability gap. This work establishes foundational benchmarks, theoretical framing, and evaluation methodologies for advancing personalized tool utilization in LLMs.
📝 Abstract
While integrating external tools into large language models (LLMs) enhances their ability to access real-time information and domain-specific services, existing approaches focus narrowly on functional tool selection following user instructions, overlooking the context-aware personalization in tool selection. This oversight leads to suboptimal user satisfaction and inefficient tool utilization, particularly when overlapping toolsets require nuanced selection based on contextual factors. To bridge this gap, we introduce ToolSpectrum, a benchmark designed to evaluate LLMs' capabilities in personalized tool utilization. Specifically, we formalize two key dimensions of personalization, user profile and environmental factors, and analyze their individual and synergistic impacts on tool utilization. Through extensive experiments on ToolSpectrum, we demonstrate that personalized tool utilization significantly improves user experience across diverse scenarios. However, even state-of-the-art LLMs exhibit the limited ability to reason jointly about user profiles and environmental factors, often prioritizing one dimension at the expense of the other. Our findings underscore the necessity of context-aware personalization in tool-augmented LLMs and reveal critical limitations for current models. Our data and code are available at https://github.com/Chengziha0/ToolSpectrum.