🤖 AI Summary
This paper addresses the insufficient robustness of function calling (FC) in large language models (LLMs) operating as autonomous agents. We systematically expose performance degradation under natural query perturbations and toolset expansion. To this end, we introduce the first dual-dimensional benchmark for FC robustness, built upon an extended Berkeley Function Calling Leaderboard (BFCL), incorporating both semantically similar tool injection and natural-language variation–based perturbation test sets. Experiments reveal that state-of-the-art FC models suffer over 35% accuracy drop under query variants and exhibit significantly increased mis-calling rates upon toolset expansion—highlighting critical deployment risks. Our analysis identifies a key limitation in existing evaluations: an overemphasis on accuracy while neglecting stability and consistency. We advocate a paradigm shift from “accuracy-centric” to “reliability-centric” evaluation, providing both theoretical grounding and an actionable benchmark for designing and deploying robust FC agents.
📝 Abstract
Large Language Models (LLMs) are increasingly acting as autonomous agents, with function calling (FC) capabilities enabling them to invoke specific tools for tasks. While prior research has primarily focused on improving FC accuracy, little attention has been given to the robustness of these agents to perturbations in their input. We introduce a benchmark assessing FC robustness in two key areas: resilience to naturalistic query variations, and stability in function calling when the toolkit expands with semantically related tools. Evaluating best-performing FC models on a carefully expanded subset of the Berkeley function calling leaderboard (BFCL), we identify critical weaknesses in existing evaluation methodologies, and highlight areas for improvement in real-world agentic deployments.